IDA Framework

The IDA framework consists of six steps [Huebner et al 2018, Figure 1], here we assume that metadata (step I) exist in sufficient detail, and that data cleaning (step II) was already performed. Metadata summarize background information about the data to properly conduct IDA steps, and a data cleaning process identifies and corrects technical errors. The data screening (step III) examines data properties to inform decisions about the intended analysis. Initial data reporting (step IV) document insight of the previous steps and can be referred to when interpreting results from the regression modeling. Consequences of these analyses can be that the analysis plan needs to be refined or updated (step V). Finally, reporting of IDA results in research papers (step VI) are necessary to ensure transparency regarding key findings that influence the analysis or interpretation of results. Further details about the elements of IDA are discussed in [TG3 papers].

IDA framework

IDA framework

References

Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link

Huebner M, Vach W, le Cessie S, Schmidt C, Lusa L. Hidden Analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Meth 2020; 20:61. Link

Scope of the regression analyses for the examples

Regression models can be used for a wide range of purposes, for the purpose of these examples the assumptions on the regression analysis set-up in this paper are listed in Table 1. Thus, IDA tasks will be explained in a well-defined, practically relevant setting. Since a key principle is that IDA does not touch the research question no associations between dependent (outcome) and independent (non-outcome) variables are considered.

Table 1: The scope of the regression analyses considered for IDA tasks

Aspects of the research plan Assumptions in this paper Reason for the assumption
Dependent (outcome) variable One dependent variable that can be continuous or binary; exclude time-to-event or longitudinal outcomes Explain IDA tasks in a well-defined, practically relevant setting
Regression models Models with linear predictors Explain IDA tasks in a well-defined, practically relevant setting
Purpose of regression model Adjust effect of one variable of interest for confounders; quantify the effects of explanatory variables on the outcome Explain IDA tasks in a well-defined, practically relevant setting
Independent variables “explanatory” or “confounder” depending on purpose of model; small to moderate number of mixed types; Not high dimensional; no repeated measurements To demonstrate IDA approaches for a mix of variables likely to be encountered in practice
Statistical analysis plan Exists, defines the outcome variable, the type of regression model to be used, and a set of independent variables IDA does not touch the research question, but may lead to an update or refinement of the analysis plan

References:

Vach W. Regression Models as a Tool in Medical Research. Chapman/Hall CRC 2012

Harrell FE. Regression Modeling Strategies. Springer (2nd ed) 2015

Royston P and Sauerbrei W. Multivariable Model Building. Wiley (2008)

[…]

Data screening and possible actions

Univariate distributions

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Continuous variables General skewness Help in interpreting results Update SAP Update intended presentation of results
Continuous variables General skewness Wide CI for coefficients Use variable as log-transformed Update intended presentation of results
Continuous variables Outliers Disproportional impact on results Winsorize or transform Model involves winsorization
Continuous variables Spike at 0 Narrow CI at 0 Use appropriate representation of variable in model Use 2 (or more) coefficients to distinguish 0 from non-0 continuous part
Categorical variables Frequencies Comparisons to default reference probably irrelevant Change reference category Contrasts compare to (new) reference category
Categorical variables Rare categories Wide CI for coefficients Collapse/exclude Fewer categories to present
Categorical variables One very frequent category Comparisons irrelevant? Exclude variable Variable omitted

Bivariate distributions

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Continuous by continuous Outliers (from the cloud) Disproportional impact on results Winsorize or transform Model involves winsorization
Continuous by continuous Correlations Wide CI for coefficients Winsorize or transform Model involves winsorization
Continuous by categorical Outliers (only visible in bivariate plot) Wide CI for coefficients
Categorical by categorical Frequent/rare combinations Comparison to default reference irrelevant Change reference category Contrasts compare to (new) reference category
Categorical by categorical Frequent/rare combinations interactions relevant? Remove interaction from model Fewer interactions to present

Missing values

What to look at Possible actions: Interpretation Possible actions: SAP Possible actions: Presentation
Per variable Number and proportion Wide CI for coefficients Remove variable if many missing values
Pattern Variables missing independently or together Omit variables together Changes model
Pattern Variables missing dependent on levels of other variables Systematic missingness? Model still based on representative? IPW needed? Weighted analysis
Complete cases Number and proportion Few cases left for main CCO analysis Multiple imputation (or other way of dealing with missing values)? Result from MI analysis? Or applicability restricted to a subpopulation?

References

Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link

Harrell FE. Regression Modeling Strategies. Springer (2nd ed) 2015

[…]

CRASH-2

Introduction to CRASH-2

Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan need to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan, which is described in more detail in the section Crash2_SAP.Rmd.

Hypothetical research aim for IDA is to develop a multivariable model for early death (death within 28 days from injury) using nine independent variables of mixed type (continuous, categorical, semicontinuous) with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome.

A prediction model was developed and validated based on this data set in “Predicting early death in patients with traumatic bleeding” Perel et al, BMJ 2012, [supplement available at]. The assumed research aim is in line with the prediction model

CRASH-2 Description

Clinical Randomisation of an Antifibrinolyticin Significant Haemorrhage(CRASH-2) was a large randomised placebo controlled trial among trauma patients with, or at risk of, significant haemorrhage, of the effects of antifibrinolytic treatment on death and transfusion requirement. The study is described at the original trial website. A public version of the data set is found at a repository of public data sets hosted by the Vanderbilt University’s Department of Biostatistics (Prof. Frank Harrell Jr.).

The data set includes 20,207 patients and 44 variables.

Note: In contrast to the analysis described in Perel et al, variables describing the economic region and the treatment allocation are missing in the public version of the data set, and while the data set contains 20,207 patients, the research paper mentions 20,127 patients having been included in the study.

Crash2 dataset contents

Source dataset

We refer to the source data set as the dataset available online here

Display the source dataset contents. This dataset is in the data-raw folder of the project directory.


Data frame:crash2

20207 observations and 44 variables, maximum # NAs:17121  
NameLabelsUnitsLevelsClassStorageNAs
entryidUnique Numbers for Entry Formsintegerinteger 0
sourceMethod of Transmission of Entry Form to CC 5integer 0
trandomisedDate of RandomizationDatedouble 0
outcomeidUnique Number From Outcome Databaseintegerinteger 80
sex 2integer 1
ageinteger 4
injurytimeHours Since Injurynumericdouble 11
injurytype 3integer 0
sbpSystolic Blood PressuremmHgintegerinteger 320
rrRespiratory Rate/minintegerinteger 191
ccCentral Capillary Refille Timesintegerinteger 611
hrHeart Rate/minintegerinteger 137
gcseyeGlasgow Coma Score Eye Openingintegerinteger 732
gcsmotorGlasgow Coma Score Motor Responseintegerinteger 732
gcsverbalGlasgow Coma Score Verbal Responseintegerinteger 735
gcsGlasgow Coma Score Totalintegerinteger 23
ddeathDate of DeathDatedouble17121
causeMain Cause of Death 7integer17118
scauseotherDescription of Other Cause of Death227integer 0
statusStatus of Patient at Outcome if Alive 3integer 3169
ddischargeDate of discharge, transfer to other hospital or day 28 from randomizationDatedouble 3185
conditionCondition of Patient at Outcome if Alive 5integer 3251
ndaysicuNumber of Days Spent in ICUnumericdouble 182
bheadinjSignificant Head Injuryintegerinteger 80
bneuroNeurosurgery Doneintegerinteger 80
bchestChest Surgery Doneintegerinteger 80
babdomenAbdominal Surgery Doneintegerinteger 80
bpelvisPelvis Surgery Doneintegerinteger 80
bpePulmonary Embolismintegerinteger 80
bdvtDeep Vein Thrombosisintegerinteger 80
bstrokeStrokeintegerinteger 80
bbleedSurgery for Bleedingintegerinteger 80
bmiMyocardial Infarctionintegerinteger 80
bgiGastrointestinal Bleedingintegerinteger 80
bloadingComplete Loading Dose of Trial Drug Givenintegerinteger 80
bmaintComplete Maintenance Dose of Trial Drug Givenintegerinteger 80
btransfBlood Products Transfusionintegerinteger 80
ncellNumber of Units of Red Call Products Transfusednumericdouble 9963
nplasmaNumber of Units of Fresh Frozen Plasma Transfusedintegerinteger 9964
nplateletsNumber of Units of Platelets Transfusedintegerinteger 9964
ncryoNumber of Units of Cryoprecipitate Transfusedintegerinteger 9964
bviiRecombinant Factor VIIa Givenintegerinteger 374
boxidTreatment Box Numberintegerinteger 0
packnumTreatment Pack Numberintegerinteger 0

VariableLevels
sourcetelephone
telephone entered manually
electronic CRF by email
paper CRF enteredd in electronic CRF
electronic CRF
sexmale
female
injurytypeblunt
penetrating
blunt and penetrating
causebleeding
head injury
myocardial infarction
stroke
pulmonary embolism
multi organ failure
other
scauseother
Acute Hypoxia
ACUTE LUNG INJURY
Acute Pulmonary Oedema
Acute Renal Failure
ACUTE RESPIRATORY DISTRESS SYNDROME (ARDS)
acute respiratory failure
acute respiratory failure+sepsis
air amboli (embolism)
Air embolism caused by penetrating lung trauma
...
statusdischarged
still in hospital
transferred to other hospital
conditionno symptoms
minor symptoms
some restriction in lifestyle but independent
dependent, but not requiring constant attention
fully dependent, requiring attention day and night

Updated analysis dataset

Additional meta-data is added to the original source data set. We write this new modified data set back to the data folder after adding additional meta-data for the following variables:

  • age - add label “Age” and unit “years”.
  • injury time - add unit “hours”.
  • total Glasgow coma score - add unit “points”.

At the stage we select the variables of interest to take in to the IDA phase by dropping variables we do not check in IDA.

As a cross check we display the contents again to ensure the additional data is added, and then write back the changes to the data folder in the file “data/a_crash2.rds”.

Input object size: 1221480 bytes; 12 variables 20207 observations New object size: 1223272 bytes; 12 variables 20207 observations Input object size: 1546808 bytes; 14 variables 20207 observations New object size: 1385720 bytes; 14 variables 20207 observations


Data frame:a_crash2

20207 observations and 14 variables, maximum # NAs:17121  
NameLabelsUnitsLevelsClassStorageNAs
entryidUnique Numbers for Entry Formsintegerinteger 0
trandomisedDate of RandomizationDatedouble 0
ddeathDate of DeathDatedouble17121
ageAgeyearsintegerinteger 4
sexSex2integer 1
sbpSystolic Blood PressuremmHgintegerinteger 320
hrHeart Rate/minintegerinteger 137
rrRespiratory Rate/minintegerinteger 191
gcsGlasgow Coma Score Totalpointsintegerinteger 23
ccCentral Capillary Refille Timesintegerinteger 611
injurytimeHours Since Injuryhoursnumericdouble 11
injurytypeInjury type3integer 0
time2deathinteger17121
earlydeathDeath within 28 days from injuryintegerinteger 0

VariableLevels
sexmale
female
injurytypeblunt
penetrating
blunt and penetrating

Section session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Hmisc_4.5-0     Formula_1.2-4   survival_3.1-12 lattice_0.20-41
##  [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.4     purrr_0.3.4    
##  [9] readr_1.4.0     tidyr_1.1.2     tibble_3.0.6    ggplot2_3.3.3  
## [13] tidyverse_1.3.0 here_1.0.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6          lubridate_1.7.9.2   png_0.1-7          
##  [4] assertthat_0.2.1    rprojroot_2.0.2     digest_0.6.27      
##  [7] R6_2.5.0            cellranger_1.1.0    backports_1.2.1    
## [10] reprex_1.0.0        evaluate_0.14       httr_1.4.2         
## [13] pillar_1.4.7        rlang_0.4.10        readxl_1.3.1       
## [16] data.table_1.13.6   rstudioapi_0.13     rpart_4.1-15       
## [19] Matrix_1.2-18       checkmate_2.0.0     rmarkdown_2.6      
## [22] splines_4.0.2       foreign_0.8-80      htmlwidgets_1.5.3  
## [25] munsell_0.5.0       broom_0.7.4         compiler_4.0.2     
## [28] modelr_0.1.8        xfun_0.20           pkgconfig_2.0.3    
## [31] base64enc_0.1-3     htmltools_0.5.1.1   nnet_7.3-14        
## [34] tidyselect_1.1.0    htmlTable_2.1.0     gridExtra_2.3      
## [37] bookdown_0.21       crayon_1.4.1        dbplyr_2.1.0       
## [40] withr_2.4.1         grid_4.0.2          jsonlite_1.7.2     
## [43] gtable_0.3.0        lifecycle_0.2.0     DBI_1.1.1          
## [46] magrittr_2.0.1      scales_1.1.1        rmdformats_1.0.1   
## [49] cli_2.3.0           stringi_1.5.3       fs_1.5.0           
## [52] latticeExtra_0.6-29 xml2_1.3.2          ellipsis_0.3.1     
## [55] generics_0.1.0      vctrs_0.3.6         RColorBrewer_1.1-2 
## [58] tools_4.0.2         glue_1.4.2          hms_1.0.0          
## [61] jpeg_0.1-8.1        yaml_2.2.1          colorspace_2.0-0   
## [64] cluster_2.1.0       rvest_0.3.6         knitr_1.31         
## [67] haven_2.3.1

Statistical analysis plan

Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan needs to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan.

Hypothetical research aim for IDA: Develop a multivariable model for early death (death within 28 days from injury) using nine independent variables of mixed type (continuous, categorical, semicontinuous) with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome.

The assumed analysis aim is in line with the prediction model presented by Perel et al, BMJ 2012, supplement available at.

Outcome variable

Early death, i.e. in-hospital death within 28 days from injury (binary variable)

Statistical methods

Logistic regression will be used to model early death by the following independent variables (measured at randomisation) deemed important to predict early death.

Demographic measurements:

  • Age (age, years)
  • Sex (sex, male or female)

Physiological measurements:

  • Systolic blood pressure (sbp, mmHg)
  • Heart rate (hr, 1/min)
  • Respiratory rate (rr, 1/min)
  • Glasgow coma score (gcs, points)
  • Central capillary refill time (cc, seconds)

Characteristics of injury measurements:

  • Time since injury (injurytime, hours)
  • Type of injury (injurytype, ‘blunt’, ‘penetrating’ or ‘blunt and penetrating’)

Restricted cubic splines with 3 degrees of freedom with knots set to default values will be used for continuous variables. As the final prediction model should be parsimonious enough to simplify its application, a backward elimination algorithm with a significance level set at \(\alpha=0.05\) will be applied to remove statistically insignificant effects. Finally, nonlinear representation of each continuous variable will be tested against linear representation at \(\alpha=0.05\). In case of lacking added value of a nonlinear effect, the model will be refitted with a linear effect for that variable.

Remarks

  • Regarding type of injury, the original paper describes its treatment in the model as follows: ‘Type of injury had three categories—-penetrating, blunt, or blunt and penetrating—but we analysed it as ’penetrating’ or ‘blunt and penetrating.’ ’ It is not clear from that description what happened to the ‘blunt’ group. (I assume they were collapsed with ‘blunt and penetrating’.) ** we are going to consider the three categories, and then check aout recommendations for the final analysis-MH**

  • The original paper describes the modeling approach as follows: ‘We used a backward step-wise approach. Firstly, we included all potential prognostic factors and interaction terms that users considered plausible. These interactions included all potential predictors with type of injury, time since injury, and age. We then removed, one at a time, terms for which we found no strong evidence of an association, judged according to the P values (<0.05) from the Wald test.’ This would mean they tested at least 24 interaction terms, each possibly using several degrees of freedom! In the final model, only an interaction of Glasgow coma score and type of injury was included.

Preparations

The outcome variable, early death (i.e., death within 28 days from injury) must be computed from the time span between date of death and date of randomization using the following logic:

  • transform ddeath and trandomisation into an interpretable date format and then compute the difference
  • interpret missing (i.e. NAs) as ‘not died within study period, at least not within 28 days’
  • if patients died after 28 days, treat as alive

This can be derived using the following code logic:

## NOTE: This is for demostration purposes, this code is not run here. 
## The derivation was executed earlier. 

a_crash2$time2death <-
  as.numeric(as.Date(a_crash2$ddeath) - as.Date(a_crash2$trandomised))

a_crash2$earlydeath[!is.na(a_crash2$time2death)] <-
  (a_crash2$time2death[!is.na(a_crash2$time2death)] <= 28) + 0

# +0 to transform it from TRUE/FALSE to 1/0
# NA in time2death means alive at day 28
a_crash2$earlydeath[is.na(a_crash2$time2death)] <- 0    

We also display the marginal distribution of the derived outcome variable.

a_crash2 %>%
  dplyr::select(earlydeath) %>%
  gtsummary::tbl_summary()
Characteristic N = 20,2071
Death within 28 days from injury 3,076 (15%)

1 n (%)

The number of deaths computed in the data set coincides with the number reported in Perel et al, BMJ 2012.

Sources

Data obtained from http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets

To download the data set, click the link to data set

Data dictionary

The data dictionary can be found LINK

References

CRASH-2 Collaborators. Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with significant haemorrhage (CRASH-2): a randomised, placebo-controlled trial. Lancet 2010;376:23-32

Perel P, Prieto-Merino D, Shakur H, Clayton T, Lecky F, Bouamra O, Russell R, Faulkner M, Steyerberg EW, Roberts I. Predicting early death in patients with traumatic bleeding: development and validation of prognostic model. BMJ 2012; 345(aug15 1): e5166.

Missing data

Per variable missingness

Number and percentage of missing.

Variable Missing (count) Missing (%)
cc 611 3.02
sbp 320 1.58
rr 191 0.95
hr 137 0.68
gcs 23 0.11
injurytime 11 0.05
age 4 0.02
sex 1 0.00
injurytype 0 0.00

Missingness patterns over variables

(In)complete cases

This section presents patients with a least one missing value. First we list out patients with at least one missing value in a filterable table.

Then we report the pattern of missing for this set of patients.

Section session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.17          kableExtra_1.3.1 gt_0.2.2         naniar_0.6.0    
##  [5] Hmisc_4.5-0      Formula_1.2-4    survival_3.1-12  lattice_0.20-41 
##  [9] forcats_0.5.1    stringr_1.4.0    dplyr_1.0.4      purrr_0.3.4     
## [13] readr_1.4.0      tidyr_1.1.2      tibble_3.0.6     ggplot2_3.3.3   
## [17] tidyverse_1.3.0  here_1.0.1      
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.0            lubridate_1.7.9.2   webshot_0.5.2      
##  [4] RColorBrewer_1.1-2  httr_1.4.2          rprojroot_2.0.2    
##  [7] UpSetR_1.4.0        tools_4.0.2         backports_1.2.1    
## [10] R6_2.5.0            rpart_4.1-15        DBI_1.1.1          
## [13] colorspace_2.0-0    nnet_7.3-14         withr_2.4.1        
## [16] tidyselect_1.1.0    gridExtra_2.3       compiler_4.0.2     
## [19] cli_2.3.0           rvest_0.3.6         htmlTable_2.1.0    
## [22] xml2_1.3.2          labeling_0.4.2      bookdown_0.21      
## [25] sass_0.3.1          scales_1.1.1        checkmate_2.0.0    
## [28] commonmark_1.7      digest_0.6.27       foreign_0.8-80     
## [31] rmarkdown_2.6       base64enc_0.1-3     jpeg_0.1-8.1       
## [34] pkgconfig_2.0.3     htmltools_0.5.1.1   dbplyr_2.1.0       
## [37] highr_0.8           htmlwidgets_1.5.3   rlang_0.4.10       
## [40] readxl_1.3.1        rstudioapi_0.13     generics_0.1.0     
## [43] farver_2.0.3        jsonlite_1.7.2      crosstalk_1.1.1    
## [46] magrittr_2.0.1      Matrix_1.2-18       Rcpp_1.0.6         
## [49] munsell_0.5.0       lifecycle_0.2.0     visdat_0.5.3       
## [52] stringi_1.5.3       yaml_2.2.1          plyr_1.8.6         
## [55] grid_4.0.2          crayon_1.4.1        haven_2.3.1        
## [58] splines_4.0.2       hms_1.0.0           knitr_1.31         
## [61] pillar_1.4.7        reprex_1.0.0        glue_1.4.2         
## [64] evaluate_0.14       latticeExtra_0.6-29 data.table_1.13.6  
## [67] modelr_0.1.8        png_0.1-7           vctrs_0.3.6        
## [70] rmdformats_1.0.1    cellranger_1.1.0    gtable_0.3.0       
## [73] assertthat_0.2.1    xfun_0.20           broom_0.7.4        
## [76] viridisLite_0.3.0   cluster_2.1.0       ellipsis_0.3.1

Univariate distribution checks

This section reports a series of univariate summary checks of the CRASH-2 dataset.

Data set overview

Using the Hmisc describe function, we provide an overview of the data set. The descriptive report also provides histograms of continuous variables. For ease of scanning the information, we group the report by measurement type.

Demographic variables

Demographic variables

2 Variables   20207 Observations

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
202034840.99934.5615.5518192430435564
lowest : 1 14 15 16 17 , highest: 92 94 95 96 99
sex: Sex
nmissingdistinct
2020612
 Value        male female
 Frequency   16935   3271
 Proportion  0.838  0.162
 

Physiological measurements

Physiological measurements

5 Variables   20207 Observations

sbp: Systolic Blood Pressure mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
198873201730.98998.4527.86 60 70 80 95110130143
lowest : 4 10 12 20 25 , highest: 225 230 234 240 250
hr: Heart Rate /min
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
200701371730.996104.523.38 70 80 90105120130140
lowest : 3 4 5 6 10 , highest: 190 192 198 200 220
rr: Respiratory Rate /min
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
20016191680.9923.067.05214162022263035
lowest : 1 2 3 4 5 , highest: 90 91 94 95 96
gcs: Glasgow Coma Score Total points
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
2018423130.86312.473.594 4 61115151515
lowest : 3 4 5 6 7 , highest: 11 12 13 14 15
 Value          3     4     5     6     7     8     9    10    11    12    13    14
 Frequency    784   520   441   584   733   576   504   663   586   951  1356  2140
 Proportion 0.039 0.026 0.022 0.029 0.036 0.029 0.025 0.033 0.029 0.047 0.067 0.106
                 
 Value         15
 Frequency  10346
 Proportion 0.513
 

cc: Central Capillary Refille Time s
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
19596611200.9453.2671.671223456
lowest : 1 2 3 4 5 , highest: 17 18 20 30 60
 Value          1     2     3     4     5     6     7     8     9    10    11    12
 Frequency   1510  5328  6020  3367  1805   802   268   271    45   139     3     7
 Proportion 0.077 0.272 0.307 0.172 0.092 0.041 0.014 0.014 0.002 0.007 0.000 0.000
                                                           
 Value         13    15    16    17    18    20    30    60
 Frequency      3    19     3     1     1     2     1     1
 Proportion 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000
 

Characteristics of injury

Characteristics of injury

2 Variables   20207 Observations

injurytime: Hours Since Injury hours
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
2019611930.9722.8442.350.51.01.02.04.06.07.0
lowest : 0.10 0.15 0.20 0.25 0.30 , highest: 22.00 45.00 48.00 72.00 96.00
injurytype: Injury type
image
nmissingdistinct
2020703
 Value                      blunt           penetrating blunt and penetrating
 Frequency                  11189                  6552                  2466
 Proportion                 0.554                 0.324                 0.122
 

Categorical variables

We now provide a closer visual examination of the categorical predictors.

Categorical ordinal plots

The Glasgow coma score, an ordinal categorical variable, is also displayed separately.

Continuous variables

A closer visual examination of continuous predictors.

There is evidence of digit preference. Explore further with targeted summaries. A more detailed univariate summaries for the variables of interest are also provided below.

Age

Distribution of subject age [years]

Distribution of subject age [years]

Five patients under the age of 17, the inclusion criteria for the study, with one patient aged 1.

Blood pressure

Distribution of SBP

Distribution of SBP

Respiratory rate

Distribution of respiratory rate

Distribution of respiratory rate

Heart rate

Distribution of heart rate

Distribution of heart rate

Central capillary refill time

Distribution of Central capillary refill time

Distribution of Central capillary refill time

Hours since injury

Distribution of hours since injury

Distribution of hours since injury

Section session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Hmisc_4.5-0     Formula_1.2-4   survival_3.1-12 lattice_0.20-41
##  [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.4     purrr_0.3.4    
##  [9] readr_1.4.0     tidyr_1.1.2     tibble_3.0.6    ggplot2_3.3.3  
## [13] tidyverse_1.3.0 here_1.0.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6          lubridate_1.7.9.2   png_0.1-7          
##  [4] assertthat_0.2.1    rprojroot_2.0.2     digest_0.6.27      
##  [7] R6_2.5.0            cellranger_1.1.0    backports_1.2.1    
## [10] reprex_1.0.0        evaluate_0.14       highr_0.8          
## [13] httr_1.4.2          pillar_1.4.7        rlang_0.4.10       
## [16] readxl_1.3.1        data.table_1.13.6   rstudioapi_0.13    
## [19] rpart_4.1-15        Matrix_1.2-18       checkmate_2.0.0    
## [22] rmarkdown_2.6       labeling_0.4.2      splines_4.0.2      
## [25] foreign_0.8-80      htmlwidgets_1.5.3   munsell_0.5.0      
## [28] broom_0.7.4         compiler_4.0.2      modelr_0.1.8       
## [31] xfun_0.20           pkgconfig_2.0.3     base64enc_0.1-3    
## [34] htmltools_0.5.1.1   nnet_7.3-14         tidyselect_1.1.0   
## [37] htmlTable_2.1.0     gridExtra_2.3       bookdown_0.21      
## [40] crayon_1.4.1        dbplyr_2.1.0        withr_2.4.1        
## [43] grid_4.0.2          jsonlite_1.7.2      gtable_0.3.0       
## [46] lifecycle_0.2.0     DBI_1.1.1           magrittr_2.0.1     
## [49] scales_1.1.1        rmdformats_1.0.1    cli_2.3.0          
## [52] stringi_1.5.3       farver_2.0.3        fs_1.5.0           
## [55] latticeExtra_0.6-29 xml2_1.3.2          ellipsis_0.3.1     
## [58] generics_0.1.0      vctrs_0.3.6         RColorBrewer_1.1-2 
## [61] tools_4.0.2         glue_1.4.2          hms_1.0.0          
## [64] jpeg_0.1-8.1        yaml_2.2.1          colorspace_2.0-0   
## [67] cluster_2.1.0       rvest_0.3.6         knitr_1.31         
## [70] haven_2.3.1         patchwork_1.1.1

Multivariate distributions

Overview

Variable correlation

corrs <- a_crash2 %>%
  dplyr::select(age, sex, sbp, hr, rr ,cc, injurytime, injurytype ) %>%
  filter(complete.cases(.)) %>%
  dplyr::mutate_all(as.numeric)

M <- cor(corrs)
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method = "color", col = col(200),
         type = "upper", order = "hclust", number.cex = .7,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
         # hide correlation coefficient on the principal diagonal
         diag = FALSE)

Variable clustering

Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.

Hmisc::varclus( ~ age +  sbp +  hr + rr + cc + gcs + injurytime + injurytype + sex, data = a_crash2)
## Hmisc::varclus(x = ~age + sbp + hr + rr + cc + gcs + injurytime + 
##     injurytype + sex, data = a_crash2)
## 
## 
## Similarity matrix (Spearman rho^2)
## 
##                                  age  sbp   hr   rr   cc  gcs injurytime
## age                             1.00 0.00 0.00 0.00 0.00 0.00       0.01
## sbp                             0.00 1.00 0.11 0.03 0.07 0.01       0.01
## hr                              0.00 0.11 1.00 0.05 0.02 0.02       0.00
## rr                              0.00 0.03 0.05 1.00 0.02 0.00       0.00
## cc                              0.00 0.07 0.02 0.02 1.00 0.02       0.00
## gcs                             0.00 0.01 0.02 0.00 0.02 1.00       0.01
## injurytime                      0.01 0.01 0.00 0.00 0.00 0.01       1.00
## injurytypepenetrating           0.02 0.00 0.01 0.00 0.00 0.06       0.05
## injurytypeblunt and penetrating 0.00 0.01 0.01 0.00 0.00 0.01       0.00
## sexfemale                       0.01 0.00 0.00 0.00 0.00 0.00       0.00
##                                 injurytypepenetrating
## age                                              0.02
## sbp                                              0.00
## hr                                               0.01
## rr                                               0.00
## cc                                               0.00
## gcs                                              0.06
## injurytime                                       0.05
## injurytypepenetrating                            1.00
## injurytypeblunt and penetrating                  0.07
## sexfemale                                        0.02
##                                 injurytypeblunt and penetrating sexfemale
## age                                                        0.00      0.01
## sbp                                                        0.01      0.00
## hr                                                         0.01      0.00
## rr                                                         0.00      0.00
## cc                                                         0.00      0.00
## gcs                                                        0.01      0.00
## injurytime                                                 0.00      0.00
## injurytypepenetrating                                      0.07      0.02
## injurytypeblunt and penetrating                            1.00      0.00
## sexfemale                                                  0.00      1.00
## 
## No. of observations used for each pair:
## 
##                                   age   sbp    hr    rr    cc   gcs injurytime
## age                             20203 19884 20066 20012 19593 20180      20193
## sbp                             19884 19887 19795 19750 19316 19883      19877
## hr                              20066 19795 20070 19943 19482 20066      20059
## rr                              20012 19750 19943 20016 19454 20014      20008
## cc                              19593 19316 19482 19454 19596 19595      19588
## gcs                             20180 19883 20066 20014 19595 20184      20173
## injurytime                      20193 19877 20059 20008 19588 20173      20196
## injurytypepenetrating           20203 19887 20070 20016 19596 20184      20196
## injurytypeblunt and penetrating 20203 19887 20070 20016 19596 20184      20196
## sexfemale                       20202 19886 20069 20015 19595 20183      20195
##                                 injurytypepenetrating
## age                                             20203
## sbp                                             19887
## hr                                              20070
## rr                                              20016
## cc                                              19596
## gcs                                             20184
## injurytime                                      20196
## injurytypepenetrating                           20207
## injurytypeblunt and penetrating                 20207
## sexfemale                                       20206
##                                 injurytypeblunt and penetrating sexfemale
## age                                                       20203     20202
## sbp                                                       19887     19886
## hr                                                        20070     20069
## rr                                                        20016     20015
## cc                                                        19596     19595
## gcs                                                       20184     20183
## injurytime                                                20196     20195
## injurytypepenetrating                                     20207     20206
## injurytypeblunt and penetrating                           20207     20206
## sexfemale                                                 20206     20206
## 
## hclust results (method=complete)
## 
## 
## Call:
## hclust(d = as.dist(1 - x), method = method)
## 
## Cluster method   : complete 
## Number of objects: 10

Plot associations.

plot(Hmisc::varclus( ~ age +  sbp +  hr + rr + cc + gcs + injurytime + injurytype + sex, data = a_crash2))

Variable redundancy

Redundancy analysis of predictor variables.

Hmisc::redun( ~ hr + rr + age + sbp + injurytype + sex  , data = a_crash2)
## 
## Redundancy Analysis
## 
## Hmisc::redun(formula = ~hr + rr + age + sbp + injurytype + sex, 
##     data = a_crash2)
## 
## n: 19689     p: 6    nk: 3 
## 
## Number of NAs:    518 
## Frequencies of Missing Values Due to Each Variable
##         hr         rr        age        sbp injurytype        sex 
##        137        191          4        320          0          1 
## 
## 
## Transformation of target variables forced to be linear
## 
## R-squared cutoff: 0.9    Type: ordinary 
## 
## R^2 with which each variable can be predicted from all other variables:
## 
##         hr         rr        age        sbp injurytype        sex 
##      0.116      0.044      0.052      0.099      0.061      0.035 
## 
## No redundant variables

Summary reports by sex

Overall

Baseline characteristics by sex.
N
male
N=16935
female
N=3271
Age
years
20203 23.0 30.0 41.0
33.7 ± 13.6
25.0 35.0 50.0
38.8 ± 16.8
Systolic Blood Pressure
mmHg
19887 80.0 95.0 110.0
98.8 ±  25.5
80.0 90.0 110.0
96.7 ±  25.7
Heart Rate
/min
20070 90.0 105.0 120.0
104.3 ±  21.2
92.0 106.0 120.0
105.2 ±  21.0
Respiratory Rate
/min
20016 20.00 22.00 26.00
23.07 ±  6.77
20.00 22.00 26.00
23.03 ±  6.58
Central Capillary Refille Time
s
19596 2.00 3.00 4.00
3.27 ± 1.72
2.00 3.00 4.00
3.23 ± 1.59
Glasgow Coma Score Total
points
20184 11.00 15.00 15.00
12.44 ±  3.72
12.00 14.00 15.00
12.62 ±  3.46
Hours Since Injury
hours
20196 1.00 2.00 4.00
2.85 ± 2.39
1.00 2.00 4.00
2.84 ± 2.67
Injury type : blunt 20207 0.53 8962/16935 0.68 2227/ 3271
  penetrating 0.35 5930/16935 0.19 621/ 3271
  blunt and penetrating 0.12 2043/16935 0.13 423/ 3271
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD.   N is the number of non-missing values.

Distribution of age by sex

Distribution of age by sex

Distribution of systolic blood pressure by sex

Distribution of systolic blood pressure by sex

Distribution of heart rate by sex

Distribution of heart rate by sex

Distribution of respiratory rate by sex

Distribution of respiratory rate by sex

Distribution of central capillary refille time by sex

Distribution of central capillary refill time by sex

Distribution of hours since injury by sex

Distribution of hours since injury by sex

Distribution of Glasgow coma score by sex

Distribution of Glasgow coma score (point scale) by sex

Distribution of Glasgow coma score (point scale) by sex

Distribution of injury type by sex

Distribution of injury type by sex

Distribution of injury type by sex

Summary reports by age

Categorize age for the purposes of exploring the relationship between age and other baseline variables. This is purely for exploratory purposes only, and not to influence the analysis strategy by pursuing the dichotomization of age.

Characteristic N = 20,2071
age_C
<30 9,070 (45%)
30-44 6,477 (32%)
45-59 3,204 (16%)
60+ 1,452 (7.2%)
NA 4 (<0.1%)

1 n (%)

Report all variables by age category.

Baseline characteristics by age categories.
N
<30
N=9070
30-44
N=6477
45-59
N=3204
60+
N=1452
Sex : female 20202 0.13 1183/9070 0.15 959/6476 0.21 659/3204 0.32 469/1452
Systolic Blood Pressure
mmHg
19884 80.0 96.0 110.0
98.1 ±  23.8
80.0 90.0 110.0
97.7 ±  25.3
80.0 94.0 112.0
100.1 ±  28.4
80.0 90.0 110.0
100.4 ±  30.2
Heart Rate
/min
20066 91.0 106.0 120.0
105.3 ±  21.3
90.0 106.0 120.0
104.7 ±  20.9
90.0 104.0 120.0
103.3 ±  21.0
88.0 100.0 116.0
101.0 ±  21.8
Respiratory Rate
/min
20012 20.00 22.00 26.00
22.93 ±  6.74
20.00 22.00 26.00
23.24 ±  6.68
20.00 22.00 26.00
23.11 ±  6.80
20.00 22.00 26.00
23.04 ±  6.89
Central Capillary Refille Time
s
19593 2.00 3.00 4.00
3.20 ± 1.77
2.00 3.00 4.00
3.27 ± 1.65
2.00 3.00 4.00
3.34 ± 1.64
2.00 3.00 4.00
3.48 ± 1.56
Glasgow Coma Score Total
points
20180 11.00 15.00 15.00
12.64 ±  3.61
11.00 14.50 15.00
12.39 ±  3.72
11.00 14.00 15.00
12.38 ±  3.70
10.00 14.00 15.00
12.00 ±  3.82
Hours Since Injury
hours
20193 1.00 2.00 4.00
2.71 ± 2.18
1.00 2.00 4.00
2.83 ± 2.28
1.00 2.50 4.50
3.12 ± 3.17
1.00 3.00 4.50
3.12 ± 2.68
Injury type : blunt 20203 0.50 4544/9070 0.53 3462/6477 0.65 2081/3204 0.76 1101/1452
  penetrating 0.38 3448/9070 0.33 2155/6477 0.23 748/3204 0.14 199/1452
  blunt and penetrating 0.12 1078/9070 0.13 860/6477 0.12 375/3204 0.10 152/1452
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD.   N is the number of non-missing values.

Distribution of systolic blood pressure by age categories

Distribution of systolic blood pressure by gcs

Distribution of heart rate by age categories

Distribution of heart rate by gcs

Distribution of respiratory rate by age categories

Distribution of respiratory rate by gcs

Distribution of central capillary refille time by age categories

Distribution of central capillary refill time by gcs

WIP: multivariate scatter plots

a_crash2 %>% dplyr::filter(!is.na(sbp)) %>% tally()
##       n
## 1 19887
a_crash2 %>% dplyr::filter(is.na(sbp)) %>% tally()
##     n
## 1 320
bigN <- a_crash2 %>% dplyr::filter(!is.na(sbp) & !is.na(age)) %>% tally()
n_miss <- a_crash2 %>% dplyr::filter(is.na(sbp) | is.na(age)) %>% tally()

title <-
  paste0("Plot of ", Hmisc::label(a_crash2$age), " and ", Hmisc::label(a_crash2$sbp))

caption <-
  paste0(
    "n = ",
    bigN,
    " subjects displayed.\n",
    n_miss,
    " subjects with a missing value in at least one of the variables."
  )


x_axis <- paste0(Hmisc::label(a_crash2$age), " [", Hmisc::units(a_crash2$age), "]")
y_axis <- paste0(Hmisc::label(a_crash2$sbp), " [", Hmisc::units(a_crash2$sbp), "]")


p1 <- a_crash2 %>%
  dplyr::filter(!is.na(sbp) & !is.na(age)) %>%
  mutate(sbp = as.numeric(sbp), 
         age = as.numeric(age)) %>%
  ggplot(aes(x = sbp, y = age)) +
  ylab(x_axis) +
  xlab(y_axis) +
  labs(
    title = title,
    caption = caption
  ) +
  geom_point(shape = 16, #size = 0.5,
             alpha = 0.5,
             color = "firebrick2") +
  geom_rug() +
  theme_minimal()

p1

WIP: Scatter plots with a third or fourth variable

Scatter plot of age and RR by sex and injury type.

Scatter plot of SBP and RR by sex and injury type.

Summary reports by Glasgow coma score

Baseline characteristics by Glasgow coma score.
N
3
N=784
4
N=520
5
N=441
6
N=584
7
N=733
8
N=576
9
N=504
10
N=663
11
N=586
12
N=951
13
N=1356
14
N=2140
15
N=10346
Age
years
20203 24.0 32.0 44.0
35.5 ± 14.9
25.0 33.0 44.0
35.5 ± 14.1
24.0 32.0 45.0
35.4 ± 14.7
23.0 31.0 45.0
35.4 ± 15.4
23.0 30.0 42.0
33.9 ± 14.0
24.0 32.0 45.0
35.7 ± 15.0
24.0 32.0 44.0
35.5 ± 14.6
24.0 31.0 42.0
34.4 ± 13.8
24.0 33.0 46.0
36.6 ± 15.6
25.0 32.0 45.0
35.9 ± 14.3
25.0 33.0 45.0
36.4 ± 15.0
24.0 31.0 44.0
35.1 ± 14.7
23.0 30.0 41.0
33.7 ± 13.8
Heart Rate
/min
20070 90.0 112.0 128.0
106.9 ±  31.3
95.0 114.0 130.0
110.8 ±  29.2
98.0 110.5 130.0
111.4 ±  25.4
90.0 110.0 123.2
106.2 ±  24.4
95.0 109.0 120.0
107.1 ±  23.0
95.0 110.0 120.0
107.4 ±  24.0
92.0 109.0 120.0
105.5 ±  21.9
96.0 110.0 124.8
108.4 ±  24.0
96.0 110.0 122.0
107.8 ±  20.4
100.0 110.0 122.0
109.3 ±  20.2
96.0 108.0 120.0
106.5 ±  20.1
92.0 105.0 120.0
104.5 ±  19.9
90.0 100.0 115.0
102.0 ±  18.9
Respiratory Rate
/min
20016 12.00 20.00 28.00
20.67 ± 10.74
16.00 22.00 28.00
22.22 ±  9.14
18.00 22.00 28.00
22.89 ±  8.69
18.00 21.00 26.00
22.12 ±  7.56
18.00 20.00 26.00
21.97 ±  7.69
18.00 22.00 28.00
23.11 ±  7.73
20.00 24.00 28.00
23.23 ±  6.99
19.00 22.00 28.00
23.05 ±  6.73
20.00 23.00 28.00
23.45 ±  6.37
20.00 24.00 28.00
24.32 ±  6.41
20.00 22.00 27.00
23.45 ±  6.53
20.00 22.00 26.00
23.41 ±  6.09
20.00 22.00 26.00
23.14 ±  6.07
Systolic Blood Pressure
mmHg
19887 70.0 85.0 103.0
88.7 ±  33.7
78.0 90.0 116.0
96.5 ±  31.2
80.0 90.0 118.0
99.0 ±  30.7
80.0 100.0 127.0
104.3 ±  32.1
80.0 100.0 130.0
105.4 ±  30.6
80.0 90.0 115.0
99.2 ±  29.4
80.0 96.0 120.0
99.6 ±  28.9
80.0 90.0 110.0
92.6 ±  28.0
80.0 90.0 110.0
94.4 ±  26.4
71.0 90.0 100.0
88.4 ±  24.7
80.0 90.0 110.0
95.9 ±  23.5
80.0 90.0 110.0
96.4 ±  22.8
90.0 100.0 110.0
100.5 ±  23.1
Central Capillary Refille Time
s
19596 3.00 4.00 5.00
4.15 ± 2.13
3.00 4.00 5.00
3.84 ± 1.90
2.00 3.00 5.00
3.76 ± 1.91
2.00 3.00 4.00
3.49 ± 1.64
2.00 3.00 4.00
3.28 ± 1.55
2.00 3.00 4.00
3.52 ± 1.69
2.00 3.00 4.00
3.40 ± 3.00
2.00 3.00 4.00
3.37 ± 1.66
2.00 3.00 4.00
3.27 ± 1.51
3.00 3.00 4.00
3.53 ± 1.60
2.00 3.00 4.00
3.40 ± 1.69
2.00 3.00 4.00
3.31 ± 1.73
2.00 3.00 4.00
3.06 ± 1.54
Sex : female 20206 0.14 107/ 784 0.13 68/ 520 0.12 53/ 441 0.16 92/ 584 0.14 100/ 733 0.15 89/ 576 0.15 74/ 504 0.19 124/ 663 0.17 97/ 586 0.21 198/ 951 0.20 270/ 1356 0.18 391/ 2139 0.16 1604/10346
Hours Since Injury
hours
20196 1.00 2.00 4.00
2.54 ± 1.94
1.00 3.00 5.00
3.26 ± 2.20
1.00 3.00 5.00
3.42 ± 2.19
2.00 3.75 6.00
3.75 ± 2.31
2.00 3.00 5.00
3.62 ± 2.20
1.00 3.00 5.00
3.30 ± 2.17
1.00 3.00 5.00
3.12 ± 2.20
1.00 2.00 4.00
3.03 ± 2.19
1.00 2.50 4.00
3.01 ± 2.05
1.00 2.00 4.00
2.75 ± 2.03
1.00 2.00 4.00
2.79 ± 1.97
1.00 2.00 4.00
2.64 ± 1.99
1.00 2.00 4.00
2.71 ± 2.69
Injury type : blunt 20207 0.62 483/ 784 0.71 371/ 520 0.73 324/ 441 0.76 443/ 584 0.76 559/ 733 0.69 399/ 576 0.67 338/ 504 0.61 407/ 663 0.64 377/ 586 0.58 550/ 951 0.60 814/ 1356 0.58 1237/ 2140 0.47 4880/10346
  penetrating 0.22 175/ 784 0.10 53/ 520 0.09 41/ 441 0.10 59/ 584 0.11 77/ 733 0.15 89/ 576 0.17 88/ 504 0.23 151/ 663 0.21 123/ 586 0.29 272/ 951 0.24 326/ 1356 0.29 629/ 2140 0.43 4458/10346
  blunt and penetrating 0.16 126/ 784 0.18 96/ 520 0.17 76/ 441 0.14 82/ 584 0.13 97/ 733 0.15 88/ 576 0.15 78/ 504 0.16 105/ 663 0.15 86/ 586 0.14 129/ 951 0.16 216/ 1356 0.13 274/ 2140 0.10 1008/10346
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD.   N is the number of non-missing values.

Distribution of age by Glasgow coma score

Distribution of age by gcs

Distribution of systolic blood pressure by Glasgow coma score

Distribution of systolic blood pressure by gcs

Distribution of heart rate by Glasgow coma score

Distribution of heart rate by gcs

Distribution of respiratory rate by Glasgow coma score

Distribution of respiratory rate by GCS

Distribution of central capillary refille time by Glasgow coma score

Distribution of central capillary refill time by GCS

Section session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.1.1 corrplot_0.84   gtsummary_1.3.6 Hmisc_4.5-0    
##  [5] Formula_1.2-4   survival_3.1-12 lattice_0.20-41 plotly_4.9.3   
##  [9] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.4     purrr_0.3.4    
## [13] readr_1.4.0     tidyr_1.1.2     tibble_3.0.6    ggplot2_3.3.3  
## [17] tidyverse_1.3.0 here_1.0.1     
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.0            usethis_2.0.1       lubridate_1.7.9.2  
##  [4] RColorBrewer_1.1-2  httr_1.4.2          rprojroot_2.0.2    
##  [7] tools_4.0.2         backports_1.2.1     R6_2.5.0           
## [10] rpart_4.1-15        DBI_1.1.1           lazyeval_0.2.2     
## [13] colorspace_2.0-0    nnet_7.3-14         withr_2.4.1        
## [16] tidyselect_1.1.0    gridExtra_2.3       compiler_4.0.2     
## [19] cli_2.3.0           rvest_0.3.6         gt_0.2.2           
## [22] htmlTable_2.1.0     xml2_1.3.2          sass_0.3.1         
## [25] labeling_0.4.2      bookdown_0.21       scales_1.1.1       
## [28] checkmate_2.0.0     commonmark_1.7      digest_0.6.27      
## [31] foreign_0.8-80      rmarkdown_2.6       base64enc_0.1-3    
## [34] jpeg_0.1-8.1        pkgconfig_2.0.3     htmltools_0.5.1.1  
## [37] dbplyr_2.1.0        highr_0.8           htmlwidgets_1.5.3  
## [40] rlang_0.4.10        readxl_1.3.1        rstudioapi_0.13    
## [43] generics_0.1.0      farver_2.0.3        jsonlite_1.7.2     
## [46] crosstalk_1.1.1     magrittr_2.0.1      Matrix_1.2-18      
## [49] Rcpp_1.0.6          munsell_0.5.0       lifecycle_0.2.0    
## [52] stringi_1.5.3       yaml_2.2.1          grid_4.0.2         
## [55] crayon_1.4.1        haven_2.3.1         splines_4.0.2      
## [58] hms_1.0.0           knitr_1.31          pillar_1.4.7       
## [61] reprex_1.0.0        glue_1.4.2          evaluate_0.14      
## [64] latticeExtra_0.6-29 data.table_1.13.6   broom.helpers_1.1.0
## [67] modelr_0.1.8        png_0.1-7           vctrs_0.3.6        
## [70] rmdformats_1.0.1    cellranger_1.1.0    gtable_0.3.0       
## [73] assertthat_0.2.1    xfun_0.20           broom_0.7.4        
## [76] viridisLite_0.3.0   cluster_2.1.0       ellipsis_0.3.1

NHANES

Introduction to NHANES

Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan need to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan, which is described in more detail in the section nhanes_SAP.Rmd.

Hypothetical research aim for IDA is to develop a multivariable model for MVPA (minutes of moderate/vigorous physical activity) with primary aim of variable selection to predict MVPA and secondary aim to study the role of systolic blood pressure in addition to variables identified. MVPA can be used to examine factors distinguishing very active participants with large amounts of time spent on MVPA from others (using untransformed data) or distinguishing participants according to percentage changes in MVPA (logarithmic scale) thus de-emphasizing extreme values.

NHANES Description

The National Health and Nutrition Examination Survey (NHANES) is a program of studies designed to assess the health and nutritional status of adults and children in the United States. The survey examines a nationally representative sample of non-institutionalized US civilians using a multistage probability sampling design that considers geographical area and minority representation. Sample weights are generated to create nationally representative estimates for the US population and subgroups defined by age, sex, and race/ethnicity. Link to CDC NHANES website. NHANES collects data on various health and behavior indicators, including physical activity and self‐reported diagnosis of prevalent health conditions such as diabetes mellitus, coronary artery disease, stroke, and cancer.

Physical activity was measured with a waist‐worn uniaxial accelerometer (AM‐7164; ActiGraph) for up to 7 days. Participants were asked to wear the devie while awake except when simming or bathing. Data were cleaned according to calibration specification and nonwear time defined by an interval of at least 60 consecutive minutes of zero activity intensity counts. Days with fewer than 10 hours of wear time were excluded and participants with at least 1 valid day of accelerometer data were included in the analysis. Mean counts per minute were calculated by dividing the sum of activity counts for a valid day by the number of minutes of wear time in that day across all valid days. (Troiano 2008)

Moderate or vigorous intensity was based on count thresholds. Time spent in such activities was determined by summing minutes in a day where the count met the criterion for that intensity.(Troiano 2008)

The NHANES 2003–2004 and 2005–2006 have a total of 14,631 participants with accelerometry data. Participants aged 30 to 85 at the time they wore the accelerometer are included. Other inclusion criteria are in line with the choices for the prediction model of 5 year mortality presented by Smirnova et al, J Gerontol A Biol Sci Med Sci 2020. The preparation of the data was based on “Organizing and Analyzing the Activity Data in NHANES” Leroux et al, Statistics in Biosciences 2019. High quality processed activity data combined with mortality and demographic information can be downloaded and used in R with code from Andrew Leroux (https://andrew-leroux.github.io/rnhanesdata/articles).

Preparations

High quality processed activity data combined with mortality and demographic information can be downloaded and used in R with code from Andrew Leroux (https://andrew-leroux.github.io/rnhanesdata/articles). The R code was modified to have fewer exclusions criteria as noted below.

  • Re-level comorbidities to assign refused/don’t know as not having the condition

  • Re-level education to have 3 levels and categorize don’t know/refused to be missing

  • Re-level alcohol consumption to include a level set to missing

  • Removed the “bad” days from Act_Analysis and Act_Flags

  • Systolic blood pressure is the mean of the non-missing of four blood pressure variables

  • Following Smirnova et al, participants were excluded who

  1. had fewer than 3 days of data with at least 10 hours of estimated wear time or were deemed by NHANES to have poor quality data; non-wear periods were identified as intervals with at least 60 consecutive minutes of zero activity counts and at most 2 minutes with counts between 0 and 100;
  2. missing mortality information or accidental death;
  3. alive with follow up less than 1 year

The NHANES dataset used in this project contains 6680 participants.

  • For the purposes of this IDA project, in contrast to Smirnova et al, we did not exclude participants who
  1. had missing body mass index (BMI) or education predictor variables ;
  2. had missing systolic blood pressure, total or high-density lipoproteins (HDL) cholesterol measurements. The final data set in Smirnova et al contained 2,978 participants.

Sources

Leroux A. Vignettes for downloading and working with NHANES 2003-2004 and 2005-2006 accelerometry data https://andrew-leroux.github.io/rnhanesdata/articles/

To download the analysis data set, click the link to data set —GITHUB

Data dictionary

The data dictionary can be found LINK —- GITHUB

References

Troiano RP, Berrigan D, Dodd KW, Mâsse LC, Tilert T, McDowell M. Physical activity in the United States measured by accelerometer. Med Sci Sports Exerc. 2008 Jan;40(1):181-8. doi: 10.1249/mss.0b013e31815a51b3. PMID: 18091006.

Leroux A, Di J, Smirnova E, Mcguffey E, Cao Q, Bayatmokhtari E, Tabacu L, Zipunnikov V, Urbanek JK, Crainiceanu C. Organizing and Analyzing the Activity Data in NHANES. Stat Biosci 11, 262–287 (2019). https://doi-org.proxy1.cl.msu.edu/10.1007/s12561-018-09229-9

Smirnova E, Leroux A, Tabacu L, Zipunnikov V, Crainiceanu C, Urbanek JK. The Predictive Performance of Objective Measures of Physical Activity Derived From Accelerometry Data for 5-Year All-Cause Mortality in Older Adults: National Health and Nutritional Examination Survey 2003–2006, The Journals of Gerontology: Series A, Volume 75, Issue 9, September 2020, Pages 1779–1785, https://doi.org/10.1093/gerona/glz193

NHANES dataset contents

Source dataset

We refer to the source data set as the dataset available online here

Display the source dataset contents. This dataset is in the data-raw folder of the project directory.


Data frame:nhanesdat

6680 observations and 58 variables, maximum # NAs:5529  
NameLevelsStorageNAs
seqninteger 0
paxcalinteger 0
paxstatinteger 0
weekdayinteger 0
sddsrvyrdouble 0
eligstatinteger 0
mortstatinteger 9
permth.exminteger 9
sdmvpsudouble 0
sdmvstradouble 0
wtint2yrdouble 0
wtmec2yrdouble 0
ridagemndouble 0
ridageexdouble 0
ridageyrdouble 0
bmidouble 56
bmi.cat4integer 56
race5integer 0
gender2integer 0
diabetes2integer 0
chf2integer 0
chd2integer 0
cancer2integer 0
stroke2integer 0
educationadult3integer 7
mobilityproblem2integer 0
drinkstatus4integer 0
drinksperweekdouble 466
smokecigs3integer 4
bpxsy1double 972
bpxsy2double1224
bpxsy3double1296
bpxsy4double5529
lbxtcdouble 270
lbdhdddouble 270
agedouble 0
sysdouble 320
tacdouble 708
tlacdouble 708
wtdouble 708
stdouble 708
mvpadouble 708
aboutdouble 708
sboutdouble 708
satpdouble 708
astpdouble 708
tlac.1double 708
tlac.2double 708
tlac.3double 708
tlac.4double 708
tlac.5double 708
tlac.6double 708
tlac.7double 708
tlac.8double 708
tlac.9double 708
tlac.10double 708
tlac.11double 708
tlac.12double 708

VariableLevels
bmi.catNormal
Underweight
Overweight
Obese
raceWhite
Mexican American
Other Hispanic
Black
Other
genderMale
Female
diabetes, chf, chdNo
 cancer, strokeYes
educationadultLess than high school
High school
More than high school
mobilityproblemNo Difficulty
Any Difficulty
drinkstatusModerate Drinker
Non-Drinker
Heavy Drinker
Missing alcohol
smokecigsNever
Former
Current

Updated analysis dataset

Additional meta-data is added to the original source data set. We write this new modified data set back to the data folder after adding additional meta-data for the following variables:

  • seqn - add label “respondent sequence number”
  • gender - add label "gender’,
  • age - add label “age” and unit “years”
  • educationadult - add label “education level”
  • permth.exm - add label “Person Months of Follow-up from MEC/Exam Date”
  • mortstat - add label “Final mortality status”
  • sys - add label “Systolic Blood pressure” and unit “mg/dl”
  • lbxtc - add label “Total cholesterol” and unit “mg/dL”
  • lbdhdd - add label “HDL cholesterol” and unit “mg/dL”
  • smokecigs - add label “smoking status”
  • drinkstatus - add label “alcohol consumption”
  • bmi - add label “body mass index” and unit “kg/m2”
  • diabetes - add label “diabetes”
  • chf - add label “congestive heart failure”
  • cancer - add label “cancer”
  • stroke - add label “stroke”
  • mobilityproblem - add label “’difficulties with mobility”
  • tac - add label “total activity counts per day”
  • tlac - add label “total log activity count (log(1+activity))”
  • wt - add label “total accelerometer wear time” and unit “minutes”
  • mvpa - add label “Moderate or vigorous physical activity” and unit “minutes”

At this stage we select the variables of interest to take in to the IDA phase by dropping variables we do not check in IDA.

As a cross check we display the contents again to ensure the additional data is added, and then write back the changes to the data folder in the file “data/a_nhanes.rda”.

Input object size: 1479216 bytes; 33 variables 6680 observations New object size: 1416624 bytes; 33 variables 6680 observations


Data frame:a_nhanes

6680 observations and 33 variables, maximum # NAs:708  
NameLabelsUnitsLevelsClassStorageNAs
seqnrespondent sequence numberintegerinteger 0
ageageyearsnumericdouble 0
gendergender2integer 0
permth.exmPerson Months of Follow-up from MEC/Exam Dateintegerinteger 9
mortstatFinal mortality statusintegerinteger 9
educationadulteducation level3integer 7
smokecigssmoking status3integer 4
drinkstatusalcohol consumption4integer 0
bmibody mass indexkg/m2numericdouble 56
diabetesdiabetes2integer 0
chfcongestive heart failure2integer 0
cancercancer2integer 0
strokestroke2integer 0
sysSystolic blood pressuremg/dlintegerinteger320
lbxtcTotal cholesterolmg/dLintegerinteger270
lbdhddHDL cholesterolmg/dLintegerinteger270
mobilityproblemdifficulties with mobility2integer 0
tactotal activity counts per daynumericdouble708
tlactotal log activity count (log(1+activity))numericdouble708
mvpaModerate or vigorous physical activityminutesnumericdouble708
wttotal accelerometer wear timeminutesnumericdouble708
tlac.1total log actvity count 12:00AM-2:00AMnumericdouble708
tlac.2total log actvity count 2:00AM-4:00AMnumericdouble708
tlac.3total log actvity count 4:00AM-6:00AMnumericdouble708
tlac.4total log actvity count 6:00AM-8:00AMnumericdouble708
tlac.5total log actvity count 8:00AM-10:00AMnumericdouble708
tlac.6total log actvity count 10:00AM-12:00PMnumericdouble708
tlac.7total log actvity count 12:00PM-2:00PMnumericdouble708
tlac.8total log actvity count 2:00PM-4:00PMnumericdouble708
tlac.9total log actvity count 4:00PM-6:00PMnumericdouble708
tlac.10total log actvity count 6:00PM-8:00PMnumericdouble708
tlac.11total log actvity count 8:00PM-10:00PMnumericdouble708
tlac.12total log actvity count 10:00PM-12:00AMnumericdouble708

VariableLevels
genderMale
Female
educationadultLess than high school
High school
More than high school
smokecigsNever
Former
Current
drinkstatusModerate Drinker
Non-Drinker
Heavy Drinker
Missing alcohol
diabetes, chfNo
 cancer, strokeYes
mobilityproblemNo Difficulty
Any Difficulty

Section session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Hmisc_4.5-0     Formula_1.2-4   survival_3.1-12 lattice_0.20-41
##  [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.4     purrr_0.3.4    
##  [9] readr_1.4.0     tidyr_1.1.2     tibble_3.0.6    ggplot2_3.3.3  
## [13] tidyverse_1.3.0 here_1.0.1     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6          lubridate_1.7.9.2   png_0.1-7          
##  [4] assertthat_0.2.1    rprojroot_2.0.2     digest_0.6.27      
##  [7] R6_2.5.0            cellranger_1.1.0    backports_1.2.1    
## [10] reprex_1.0.0        evaluate_0.14       httr_1.4.2         
## [13] pillar_1.4.7        rlang_0.4.10        readxl_1.3.1       
## [16] data.table_1.13.6   rstudioapi_0.13     rpart_4.1-15       
## [19] Matrix_1.2-18       checkmate_2.0.0     rmarkdown_2.6      
## [22] splines_4.0.2       foreign_0.8-80      htmlwidgets_1.5.3  
## [25] munsell_0.5.0       broom_0.7.4         compiler_4.0.2     
## [28] modelr_0.1.8        xfun_0.20           pkgconfig_2.0.3    
## [31] base64enc_0.1-3     htmltools_0.5.1.1   nnet_7.3-14        
## [34] tidyselect_1.1.0    htmlTable_2.1.0     gridExtra_2.3      
## [37] bookdown_0.21       crayon_1.4.1        dbplyr_2.1.0       
## [40] withr_2.4.1         grid_4.0.2          jsonlite_1.7.2     
## [43] gtable_0.3.0        lifecycle_0.2.0     DBI_1.1.1          
## [46] magrittr_2.0.1      scales_1.1.1        rmdformats_1.0.1   
## [49] cli_2.3.0           stringi_1.5.3       fs_1.5.0           
## [52] latticeExtra_0.6-29 xml2_1.3.2          ellipsis_0.3.1     
## [55] generics_0.1.0      vctrs_0.3.6         RColorBrewer_1.1-2 
## [58] tools_4.0.2         glue_1.4.2          hms_1.0.0          
## [61] jpeg_0.1-8.1        yaml_2.2.1          colorspace_2.0-0   
## [64] cluster_2.1.0       rvest_0.3.6         knitr_1.31         
## [67] haven_2.3.1

Statistical analysis plan

Since a key principle of IDA is not to touch the research questions, before IDA commences the research aim and statistical analysis plan needs to be in place. IDA may lead to an update or refinement of the analysis plan. To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan.

Hypothetical research aim for IDA: The primary aim is to develop a multivariable model for MVPA (minutes of moderate/vigorous physical activity) with primary aim of variable selection to predict MVPA. Specifcially, the role of gender and age will be investigated. A secondary aim to study the role of systolic blood pressure in addition to variables identified. MVPA can be used to examine factors distinguishing very active participants with large amounts of time spent on MVPA from others (using untransformed data) or distinguishing participants according to percentage changes in MVPA (logarithmic scale) thus de-emphasizing extreme values.

The inclusion criteria are in line with the choices for the prediction model of 5 year mortality presented by Smirnova et al, J Gerontol A Biol Sci Med Sci 2020.

Statistical methods

Linear regression models will be used to model MVPA. Explanatory variables are age, gender, education, smoking, alcohol consumption, BMI, comorbidities (cancer, CHF, stroke), cholesterol (total, HDL). Partial R-squared will be used to identify an appropriate set of variables to predict MVPA. A secondary aim is to study the role of systolic blood pressure on MVPA in a linear regression model with variables identified in the previous step.

Variables

Outcome variable

MVPA (total minutes of moderate/vigorous physical activity which is defined as more than 2020 counts per minute) (mvpa, minutes)

Sociodemographic variables

  • age at examination (i.e. when participants wore the device) (age, years)
  • gender (gender, “Male” and “Female”)
  • race/ethnicity (non-Hispanic “White”, non-Hispanic “Black”, “Mexican American”, and “Other”)
  • education (“Less than high school”, “High school” (high school graduate/general educational development [GED]), “More than high school” (some college, and college graduate)) (educationadult)
  • 5 year mortality, NAs for individuals with follow up less than 5 years and alive (yr5.mort)
  • Person Months of Follow-up from MEC/Exam Date (permth.exm) (follow-up time in this cohort in years = permth.exm/12)
  • final mortality status (mortstat, 0, 1, NAs for individuals with follow up less than 5 years and alive)

Health and behavior variables

  • smoking status (Current, Former [those reporting quitting within the previous 6 months], and Never) (smokecigs)
  • alcohol consumption (drinkstatus) (Non-Drinker, Moderate Drinker, Heavy Drinker, Missing alcohol)
  • bmi (bmi, kg/m2)
  • obesity (bmi.cat, No-Yes)
  • diabetes (diabetes)
  • congestive heart failure (chf, No-Yes)
  • cancer (cancer, No-Yes)
  • stroke (stroke, No-Yes)
  • average systolic blood pressure using the 4 measurements per participant (sys, mmHg)
  • Total cholesterol (lbxtc, mg/dL)
  • HDL cholesterol (lbdhdd, mg/dL)
  • difficulties with mobility (mobilityproblem, “No Difficulty”, “Any Difficulty” = a positive response to difficulty walking a quarter-mile, difficulty climbing 10 stairs, or use of any special equipment to walk)

Physical activity data

Summary measures are calculated due to the large size of minute level accelerometer-derived physical activity data.

  • total activity counts per day (TAC/d)
  • total log activity count (TLAC log(1+TAC))
  • total minutes of moderate/vigorous physical activity (MVPA)
  • total accelerometer wear time (WT)
  • total log activity count summary measures (tlac.1,tlac.2, …,tlac.12`) in each 2-hr window, i.e. 12AM-2AM, 2AM-4AM, 4AM-6AM, etc.

Initial data analysis strategy

1. Statistical analysis plan: as assumed above, the analysis strategy to answer the main research question has been prespecified. It comprises of the set of independent variables to be considered in a model, the outcome variable, and the analytical strategy to build the regression model.

SAP is listed above

2. Data dictionary and metadata: a detailed data dictionary should be available which informs about the meaning of each variable in context of the research question, the units of measurement, the possible levels in case of categorical variables, or admissible values. More generally, metadata, also refer to information about the research study protocol and data collection processes.

A data dictionary is available.

3. Domain expertise and pivotal covariates (‘very important predictors’, VIPs)

It has been shown that physcial activity declines with age and men report higher levels of activity than women. Age and gender, also defined in the research aim, are pivotal covariates.

Keadle S et al. Prevalence and trends in physical activity among older adults in the United States: A comparison across three national surveys. Prev Med. 2016 Aug; 89: 37–43. https://doi.org/10.1016/j.ypmed.2016.05.009

Clarke TC, Norris T, and Schiller JS. Early Release of Selected Estimates Based on Data From the 2018 National Health Interview Survey. https://www.cdc.gov/nchs/nhis/releases/released201905.htm#7a

3.2. Domain expertise may also be useful to specify in advance which variables are expected to correlate with each other. This background knowledge could be summarized in a directed or undirected acyclic graph connecting the covariates with each other as also suggested by Heinze et al, 2018.

3.3. Missing value mechanisms: if not already specified in meta data, domain experts should also be consulted to explain possible reasons for the occurrence of missing values for each variable, which may be categorized as systematic or unsystematic.

Missing values are expected due to the nature of survey research. Domain expertice would be helpful in identifying specific expectations. Missingness of some covariates may be associated with the outcome variable. This will be considered in the IDA domain ‘Missing values’ to identify approaches or updates to the SAP.

IDA domain: missing values

1. Number and proportion of missing values for each independent variable, for the dependent variable and for the analysis as a whole.

Number and proportion of missing values will be computed for all variables.

2. Patterns of missing values across all independent variables, either as tables or appropriately visualized.

We will create missing value indicators for each covariate and will then summarize patterns by means of a heat map and a dendogram.

3. Patterns of missing values associated with the outcome variable

This may need to change.

From Lee at al (STRATOS)

  • A table of the observed characteristics for the “complete” versus “incomplete” (or all) participants, or by whether variables with substantial missingness are observed.

  • An assessment of the predictors of missingness, e.g. using a logistic regression model fitted to an indicator for being a complete record, and predictors of missing values i.e. associations with the incomplete variables.

IDA domain: univariate distributions

3. For categorical variables (including the dependent variable): frequency and proportion for each category.

Demographic variables, smoking status, alcohol consumption, comorbidities, and mortality status will be described by frequencies and proportions.

4. For continuous variables (including the dependent variable): high-resolution histogram, summary of main percentiles (1st, 10th, 25th, 50th, 75th, 90th, 99th) and interquartile range, 5 highest and 5 lowest values, first four moments (mean, variance, skewness, curtosis), standard deviation, number of distinct values.

Summaries for all continuous variables (age, BMI, physiological variables, physical activity) will be created to depict their marginal distributions by means of high-resolution histograms. Furthermore, each continuous variable will be described by 1st, 10th, 25th, 50th, 75th, 90th, 99th percentiles, interquartile range, the 5 highest and 5 lowest values, the first four moments (mean, variance, skewness, curtosis), standard deviation, and the number of distinct values.

While the outcome variable MVPA is the only physical activity variable in the analysis plan, since it is a derived variables from some of the other physical activity variables, others will be looked at in the univariate step to identify potential issues of skewness or unusual values.

The graphical summary for each variable will serve to suggest transformations for each variable:

  • no transformation (in case of approximate symmetry);
  • \(\log_{10}(x+1)\) transformation (in case of skewness)

The distributions of transformed variables will be evaluated as well as described above.

It is assumed that the data have been cleaned, but unusual values will be identified and possibly excluded.

IDA domain: multivariate system of variables

5. Matrix/heatmap of Pearson correlation coefficients between all independent variables.

Pearson correlation coefficients will be computed between all independent variables. The correlation coefficients will be depicted by means of a (quadratic) heat map. Moreover, a network graph between all independent variables will be constructed, which will be thresholded at an absolute correlation coefficient of 0.3.

Spearman correlation coefficients will be computed as well, and the 10 pairs of covariates with the largest absolute difference between Pearson and Spearman correlation coefficients will be flagged. These pairs will be graphically investigated by constructing separate scatterplots.

6. Appropriate visual (and numerical) presentations of the association of each covariate with the two pivotal covariates.

A redundancy analysis will be conducted for each variable. This analysis identifies predictors that can be almost perfectly predicted by flexible parametric additive models performed on the companion covariates.

Categorical and continuous variables will be summarized with counts and proportions or medians and quartiles, as appropriate, in a table stratified by sex and age groups.

Scatterplots of continuous variables by age will be constructed stratified by gender.

7. If interactions between covariates were prespecified to be included in the regression model, special attention should be given to evaluate the bivariate distribution of the interacting covariates. Appropriate graphical displays (see 6) should be used to visualise these distributions.

Interactions between age and gender will be considered. The distribution of age will be depicted as histogram stratified for gender.

8. For a derived outcome variable, the bivariate distribution of these variables with the outcome variable should be evaluated with appropriate visualizations..

Scatter plots of the physical activity variables with MVPA will be constructed with trend lines.

Missing data

Per variable missingness

Number and percentage of missing.

Variable Missing (count) Missing (%)
tac 708 10.60
tlac 708 10.60
mvpa 708 10.60
wt 708 10.60
tlac.1 708 10.60
tlac.2 708 10.60
tlac.3 708 10.60
tlac.4 708 10.60
tlac.5 708 10.60
tlac.6 708 10.60
tlac.7 708 10.60
tlac.8 708 10.60
tlac.9 708 10.60
tlac.10 708 10.60
tlac.11 708 10.60
tlac.12 708 10.60
sys 320 4.79
lbxtc 270 4.04
lbdhdd 270 4.04
bmi 56 0.84
permth.exm 9 0.13
mortstat 9 0.13
educationadult 7 0.10
smokecigs 4 0.06
age 0 0.00
gender 0 0.00
drinkstatus 0 0.00
diabetes 0 0.00
chf 0 0.00
cancer 0 0.00
stroke 0 0.00
mobilityproblem 0 0.00

Variable summaries for complete vs incomplete cases

Participant characteristics by missing status of MVPA
complete (N=708) incomplete (N=5972) Total (N=6680) p value
age < 0.001
   Median 48.375 53.750 53.167
   Q1, Q3 38.583, 64.271 41.646, 67.250 41.333, 67.000
   Range 30.000 - 84.917 30.000 - 84.917 30.000 - 84.917
gender 0.432
   Male 359 (50.7%) 2935 (49.1%) 3294 (49.3%)
   Female 349 (49.3%) 3037 (50.9%) 3386 (50.7%)
education level 0.072
   N-Miss 3 4 7
   Less than high school 216 (30.6%) 1683 (28.2%) 1899 (28.5%)
   High school 186 (26.4%) 1448 (24.3%) 1634 (24.5%)
   More than high school 303 (43.0%) 2837 (47.5%) 3140 (47.1%)
body mass index 0.796
   Median 28.400 28.080 28.100
   Q1, Q3 24.373, 32.353 24.730, 32.230 24.718, 32.250
   Range 16.570 - 72.280 13.360 - 130.210 13.360 - 130.210
smoking status 0.051
   N-Miss 2 2 4
   Never 342 (48.4%) 2911 (48.8%) 3253 (48.7%)
   Former 185 (26.2%) 1759 (29.5%) 1944 (29.1%)
   Current 179 (25.4%) 1300 (21.8%) 1479 (22.2%)
alcohol consumption 0.008
   Moderate Drinker 359 (50.7%) 3090 (51.7%) 3449 (51.6%)
   Non-Drinker 238 (33.6%) 2098 (35.1%) 2336 (35.0%)
   Heavy Drinker 40 (5.6%) 389 (6.5%) 429 (6.4%)
   Missing alcohol 71 (10.0%) 395 (6.6%) 466 (7.0%)
Final mortality status 0.205
   Median 0.000 0.000 0.000
   Q1, Q3 0.000, 0.000 0.000, 0.000 0.000, 0.000
   Range 0.000 - 1.000 0.000 - 1.000 0.000 - 1.000
diabetes 0.659
   No 614 (86.7%) 5214 (87.3%) 5828 (87.2%)
   Yes 94 (13.3%) 758 (12.7%) 852 (12.8%)
congestive heart failure 0.538
   No 677 (95.6%) 5739 (96.1%) 6416 (96.0%)
   Yes 31 (4.4%) 233 (3.9%) 264 (4.0%)
cancer 0.106
   No 649 (91.7%) 5359 (89.7%) 6008 (89.9%)
   Yes 59 (8.3%) 613 (10.3%) 672 (10.1%)
stroke 0.163
   No 672 (94.9%) 5734 (96.0%) 6406 (95.9%)
   Yes 36 (5.1%) 238 (4.0%) 274 (4.1%)
Systolic blood pressure 0.090
   Median 123.000 124.000 124.000
   Q1, Q3 113.000, 135.000 113.000, 138.000 113.000, 138.000
   Range 82.000 - 228.000 73.000 - 270.000 73.000 - 270.000
Total cholesterol 0.394
   Median 199.000 201.000 201.000
   Q1, Q3 172.000, 229.000 175.000, 228.000 175.000, 229.000
   Range 99.000 - 704.000 82.000 - 650.000 82.000 - 704.000
HDL cholesterol 0.260
   Median 50.000 52.000 52.000
   Q1, Q3 42.000, 63.000 43.000, 64.000 42.000, 63.000
   Range 17.000 - 164.000 17.000 - 188.000 17.000 - 188.000
difficulties with mobility 0.817
   No Difficulty 542 (76.6%) 4595 (76.9%) 5137 (76.9%)
   Any Difficulty 166 (23.4%) 1377 (23.1%) 1543 (23.1%)
total log activity count (log(1+activity))
   Median NA 2910.926 2910.926
   Q1, Q3 NA 2384.757, 3430.648 2384.757, 3430.648
   Range NA 313.083 - 6122.678 313.083 - 6122.678
total accelerometer wear time
   Median NA 852.071 852.071
   Q1, Q3 NA 782.851, 922.036 782.851, 922.036
   Range NA 600.000 - 1440.000 600.000 - 1440.000

It appears that participants with incomplete physical activity data are older.

Missingness patterns over variables

(In)complete cases

This section presents patients with a least one missing value. First we list out patients with at least one missing value in a filterable table.

Then we report the pattern of missing for this set of patients.

Section session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] VIM_6.1.0        colorspace_2.0-0 arsenal_3.6.1    DT_0.17         
##  [5] kableExtra_1.3.1 gt_0.2.2         naniar_0.6.0     Hmisc_4.5-0     
##  [9] Formula_1.2-4    survival_3.1-12  lattice_0.20-41  forcats_0.5.1   
## [13] stringr_1.4.0    dplyr_1.0.4      purrr_0.3.4      readr_1.4.0     
## [17] tidyr_1.1.2      tibble_3.0.6     ggplot2_3.3.3    tidyverse_1.3.0 
## [21] here_1.0.1      
## 
## loaded via a namespace (and not attached):
##  [1] ellipsis_0.3.1      class_7.3-17        rio_0.5.16         
##  [4] visdat_0.5.3        rprojroot_2.0.2     htmlTable_2.1.0    
##  [7] base64enc_0.1-3     fs_1.5.0            rstudioapi_0.13    
## [10] farver_2.0.3        lubridate_1.7.9.2   ranger_0.12.1      
## [13] xml2_1.3.2          splines_4.0.2       robustbase_0.93-7  
## [16] knitr_1.31          jsonlite_1.7.2      broom_0.7.4        
## [19] cluster_2.1.0       dbplyr_2.1.0        png_0.1-7          
## [22] compiler_4.0.2      httr_1.4.2          backports_1.2.1    
## [25] assertthat_0.2.1    Matrix_1.2-18       cli_2.3.0          
## [28] htmltools_0.5.1.1   tools_4.0.2         gtable_0.3.0       
## [31] glue_1.4.2          Rcpp_1.0.6          carData_3.0-4      
## [34] cellranger_1.1.0    vctrs_0.3.6         crosstalk_1.1.1    
## [37] lmtest_0.9-38       xfun_0.20           laeken_0.5.1       
## [40] openxlsx_4.2.3      rvest_0.3.6         lifecycle_0.2.0    
## [43] DEoptimR_1.0-8      MASS_7.3-51.6       zoo_1.8-8          
## [46] scales_1.1.1        hms_1.0.0           RColorBrewer_1.1-2 
## [49] yaml_2.2.1          curl_4.3            gridExtra_2.3      
## [52] UpSetR_1.4.0        sass_0.3.1          rpart_4.1-15       
## [55] latticeExtra_0.6-29 stringi_1.5.3       highr_0.8          
## [58] e1071_1.7-4         checkmate_2.0.0     boot_1.3-25        
## [61] zip_2.1.1           rlang_0.4.10        pkgconfig_2.0.3    
## [64] commonmark_1.7      evaluate_0.14       htmlwidgets_1.5.3  
## [67] labeling_0.4.2      tidyselect_1.1.0    plyr_1.8.6         
## [70] magrittr_2.0.1      bookdown_0.21       R6_2.5.0           
## [73] generics_0.1.0      DBI_1.1.1           pillar_1.4.7       
## [76] haven_2.3.1         foreign_0.8-80      withr_2.4.1        
## [79] abind_1.4-5         sp_1.4-5            nnet_7.3-14        
## [82] modelr_0.1.8        crayon_1.4.1        car_3.0-10         
## [85] rmarkdown_2.6       jpeg_0.1-8.1        readxl_1.3.1       
## [88] data.table_1.13.6   rmdformats_1.0.1    vcd_1.4-8          
## [91] reprex_1.0.0        digest_0.6.27       webshot_0.5.2      
## [94] munsell_0.5.0       viridisLite_0.3.0

Univariate distribution checks

This section reports a series of univariate summary checks of the NHANES dataset.

## Rows: 6,680
## Columns: 33
## $ seqn            <labelled> 21009, 21010, 21012, 21015, 21017, 21018, 2101...
## $ age             <labelled> 56.00000, 52.83333, 63.83333, 83.91667, 37.083...
## $ gender          <fct> Male, Female, Male, Male, Female, Female, Female, F...
## $ permth.exm      <labelled> 135, 149, 127, 24, 151, 154, 153, 154, 141, 14...
## $ mortstat        <labelled> 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0...
## $ educationadult  <fct> High school, More than high school, High school, Mo...
## $ smokecigs       <fct> Never, Current, Current, Former, Current, Never, Fo...
## $ drinkstatus     <fct> Non-Drinker, Heavy Drinker, Missing alcohol, Non-Dr...
## $ bmi             <labelled> 31.26, 25.49, 19.60, 28.32, 19.34, 16.57, 38.0...
## $ diabetes        <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes...
## $ chf             <fct> No, No, No, No, No, No, Yes, No, No, No, No, No, No...
## $ cancer          <fct> No, No, No, Yes, No, No, No, No, No, No, Yes, No, N...
## $ stroke          <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
## $ sys             <labelled> 120, 133, 123, 154, 103, 137, 115, 131, 121, 1...
## $ lbxtc           <labelled> 254, 174, 191, 141, 184, NA, 173, 230, 261, 21...
## $ lbdhdd          <labelled> 37, 119, 92, 34, 77, NA, 45, 51, 29, 68, 53, 4...
## $ mobilityproblem <fct> No Difficulty, No Difficulty, Any Difficulty, Any D...
## $ tac             <labelled> 409352.71, 286407.71, 130778.29, 102562.86, 41...
## $ tlac            <labelled> 3522.427, 3334.503, 2749.086, 2103.580, 3689.4...
## $ mvpa            <labelled> 48.285714, 9.428571, 4.714286, 3.000000, 58.83...
## $ wt              <labelled> 900.2857, 783.4286, 1053.0000, 813.1429, 833.8...
## $ tlac.1          <labelled> 0.0000000, 0.0000000, 161.7450224, 5.6786074, ...
## $ tlac.2          <labelled> 0.000000, 0.000000, 128.091725, 7.244960, 0.00...
## $ tlac.3          <labelled> 66.563485, 0.000000, 145.091848, 8.942295, 0.0...
## $ tlac.4          <labelled> 476.33325, 0.00000, 74.34726, 18.38650, 459.83...
## $ tlac.5          <labelled> 612.21257, 358.95003, 152.73994, 106.23999, 57...
## $ tlac.6          <labelled> 586.1977, 449.0983, 249.8184, 286.1406, 571.73...
## $ tlac.7          <labelled> 462.8831, 514.9402, 352.7644, 393.9136, 637.96...
## $ tlac.8          <labelled> 587.5167, 550.7981, 277.3621, 321.9635, 634.96...
## $ tlac.9          <labelled> 315.9585, 487.6278, 302.4908, 327.1102, 254.62...
## $ tlac.10         <labelled> 251.4170, 527.7868, 310.0226, 313.5571, 338.45...
## $ tlac.11         <labelled> 159.78640, 401.07109, 312.24852, 254.92505, 19...
## $ tlac.12         <labelled> 3.558282, 44.230766, 282.363630, 59.477670, 26...

Data set overview

Using the Hmisc describe function, we provide an overview of the data set. The descriptive report also provides histograms of continuous variables. For ease of scanning the information, we group the report by measurement type.

Demographic and lifestyle variables

Demographic and lifestyle variables

6 Variables   5972 Observations

ageyears
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
59720660154.8717.6432.1734.5841.6553.7567.2576.5080.83
lowest : 30.00000 30.08333 30.16667 30.25000 30.33333 , highest: 84.58333 84.66667 84.75000 84.83333 84.91667
gender
nmissingdistinct
597202
 Value        Male Female
 Frequency    2935   3037
 Proportion  0.491  0.509
 

educationadult: education level
image
nmissingdistinct
596843
 Value      Less than high school           High school More than high school
 Frequency                   1683                  1448                  2837
 Proportion                 0.282                 0.243                 0.475
 

bmi: body mass index kg/m2
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5928442161129.076.80420.7822.1224.7328.0832.2337.0740.80
lowest : 13.36 14.65 14.70 15.91 15.92 , highest: 62.50 62.77 63.42 63.87 130.21
smokecigs: smoking status
image
nmissingdistinct
597023
 Value        Never  Former Current
 Frequency     2911    1759    1300
 Proportion   0.488   0.295   0.218
 

drinkstatus: alcohol consumption
image
nmissingdistinct
597204
 Value      Moderate Drinker      Non-Drinker    Heavy Drinker  Missing alcohol
 Frequency              3090             2098              389              395
 Proportion            0.517            0.351            0.065            0.066
 

Physiological measurements

Physiological measurements

3 Variables   5972 Observations

sys: Systolic blood pressure mg/dl
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
56982741371127.422.26100.0105.0113.0124.0138.0154.0166.1
lowest : 73 80 81 83 85 , highest: 226 230 238 256 270
lbxtc: Total cholesterol mg/dL
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
57422302641204.146.4143155175201228258277
lowest : 82 83 85 92 94 , highest: 431 440 458 539 650
lbdhdd: HDL cholesterol mg/dL
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5742230109154.6417.9133374352647685
lowest : 17 22 23 24 25 , highest: 146 151 152 154 188

Comorbidities

Comorbidities

5 Variables   5972 Observations

mortstat: Final mortality status
nmissingdistinctInfoSumMeanGmd
5964820.44110680.17910.2941

diabetes
nmissingdistinct
597202
 Value         No   Yes
 Frequency   5214   758
 Proportion 0.873 0.127
 

chf: congestive heart failure
nmissingdistinct
597202
 Value         No   Yes
 Frequency   5739   233
 Proportion 0.961 0.039
 

cancer
nmissingdistinct
597202
 Value         No   Yes
 Frequency   5359   613
 Proportion 0.897 0.103
 

stroke
nmissingdistinct
597202
 Value        No  Yes
 Frequency  5734  238
 Proportion 0.96 0.04
 

Physical activity variables

Physical activity

17 Variables   5972 Observations

mobilityproblem: difficulties with mobility
nmissingdistinct
597202
 Value       No Difficulty Any Difficulty
 Frequency            4595           1377
 Proportion          0.769          0.231
 

tac: total activity counts per day
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972059651244811143738 69233 94872150571223572314224417410486450
lowest : 8263.000 8931.833 12123.000 14642.000 15656.000
highest: 981517.167 986261.000 986593.8001097823.5001122542.600

tlac: total log activity count (log(1+activity))
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
59720596912900873.51613190023852911343138774164
lowest : 313.0835 364.4561 400.8157 429.9288 466.0362
highest:5436.15485492.53955588.34015655.46806122.6779

mvpa: Moderate or vigorous physical activity minutes
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
597201163119.1920.9 0.800 1.429 4.00012.00026.76246.00059.921
lowest : 0.0000000 0.1428571 0.1666667 0.2000000 0.2500000
highest:180.8333333186.2000000194.8000000208.5000000249.0000000

wt: total accelerometer wear time minutes
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972036131866.1139.8 684.3 721.0 782.9 852.1 922.01000.61111.5
lowest : 600.000 601.500 602.000 603.000 604.000 , highest: 1425.286 1426.250 1426.286 1426.857 1440.000
tlac.1: total log actvity count 12:00AM-2:00AM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972026560.82930.9251.83 0.00 0.00 0.00 0.00 24.38 94.43169.25
lowest : 0.0000000 0.1569446 0.1831020 0.2299197 0.2559656
highest:597.3808309620.0469233674.1677375709.3300116719.0239316

tlac.2: total log actvity count 2:00AM-4:00AM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972017700.65319.0934.47 0.00 0.00 0.00 0.00 2.91 51.83110.64
lowest : 0.00000000 0.09902103 0.11552453 0.15694461 0.23104906
highest:586.34967162611.00545824617.44773130737.25383394775.42871350

tlac.3: total log actvity count 4:00AM-6:00AM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972028340.85543.2970.78 0.00 0.00 0.00 0.00 38.74147.59248.43
lowest : 0.0000000 0.1155245 0.1386294 0.2299197 0.2682397
highest:679.1484297697.1093552704.5766819719.3198459769.6014301

tlac.4: total log actvity count 6:00AM-8:00AM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972052850.998177178.6 0.00 0.00 36.94137.34282.09416.35496.25
lowest : 0.0000000 0.2299197 0.3465736 0.6148132 0.6839274
highest:774.8811640792.6938042822.1482092832.9933042857.9018816

tlac.5: total log actvity count 8:00AM-10:00AM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972058341339.3191.7 39.52102.56221.28346.74460.18552.17610.19
lowest : 0.0000000 0.2310491 0.7250248 0.8652549 1.0357837
highest:812.0225306812.8675420813.2942210824.5800445888.1759271

tlac.6: total log actvity count 10:00AM-12:00PM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972059311407.7163.6150.4218.6316.2415.0506.7589.9634.9
lowest : 0.0000000 0.6986213 2.6001909 4.5903937 5.7234361
highest:807.7712473808.7247458811.5701740884.1169241892.0314653

tlac.7: total log actvity count 12:00PM-2:00PM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972059471418146.9192.1250.4337.6423.5507.2581.3623.7
lowest : 0.000000 1.734669 2.704424 5.605670 6.387910
highest:788.370472796.082067813.380498821.733575885.445891

tlac.8: total log actvity count 2:00PM-4:00PM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972059541411.7147.8192.1243.1323.6414.3501.7577.5619.9
lowest : 0.000000 1.974752 3.096473 4.094345 5.772020
highest:792.683985837.042353846.553847877.212734904.872351

tlac.9: total log actvity count 4:00PM-6:00PM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972059551397140.3185.4234.8316.4401.8483.5553.6591.4
lowest : 0.000000 2.957040 3.401197 4.148165 5.084134
highest:771.497952783.128869801.039991809.429425822.294800

tlac.10: total log actvity count 6:00PM-8:00PM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972059321337.6151.3114.1165.5246.6339.5433.1504.4548.9
lowest : 0.000000 1.311822 1.353699 1.753975 3.459493
highest:778.168243778.774433802.020060851.421446860.123328

tlac.11: total log actvity count 8:00PM-10:00PM
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5972057861223.2158.2 10.22 42.32116.77212.72315.90411.75471.84
lowest : 0.0000000 0.6229449 0.6708919 1.0233141 1.0525597
highest:724.9040071753.8848070821.4989318826.3463412839.8942777

tlac.12: total log actvity count 10:00PM-12:00AM
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     5972        0     4943    0.995    95.37    114.3    0.000    0.000    6.693 
      .50      .75      .90      .95 
   55.438  141.863  251.308  328.945 
 
lowest : 0.00000000 0.09902103 0.17328680 0.27798716 0.41291025
highest:683.58618305698.46723961702.66304648707.15487443733.61717206

Categorical variables

We now provide a closer visual examination of the categorical predictors.

Continuous variables

A closer visual examination of continuous predictors and the outcome variable.

There is evidence of influential points in some of the distributions. This is explored further with targeted summaries. A more detailed univariate summaries for the variables of interest are also provided below.

Age

## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_bar).
Distribution of age

Distribution of age

Blood pressure

Distribution of SBP

Distribution of SBP

Body mass index

Distribution of respiratory rate

Distribution of respiratory rate

There is a participant with an unusal high value (130.2). It is possible that this is an entry error (bmi=30.2).

Total cholesterol

Distribution of total cholesterol

Distribution of total cholesterol

Distribution of HDL

Distribution of HDL

Section session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] Hmisc_4.5-0     Formula_1.2-4   survival_3.1-12 lattice_0.20-41
##  [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.4     purrr_0.3.4    
##  [9] readr_1.4.0     tidyr_1.1.2     tibble_3.0.6    ggplot2_3.3.3  
## [13] tidyverse_1.3.0 here_1.0.1     
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2          jsonlite_1.7.2      splines_4.0.2      
##  [4] modelr_0.1.8        assertthat_0.2.1    highr_0.8          
##  [7] latticeExtra_0.6-29 cellranger_1.1.0    yaml_2.2.1         
## [10] pillar_1.4.7        backports_1.2.1     glue_1.4.2         
## [13] digest_0.6.27       RColorBrewer_1.1-2  checkmate_2.0.0    
## [16] rvest_0.3.6         colorspace_2.0-0    htmltools_0.5.1.1  
## [19] Matrix_1.2-18       pkgconfig_2.0.3     broom_0.7.4        
## [22] haven_2.3.1         bookdown_0.21       patchwork_1.1.1    
## [25] scales_1.1.1        jpeg_0.1-8.1        htmlTable_2.1.0    
## [28] generics_0.1.0      farver_2.0.3        ellipsis_0.3.1     
## [31] withr_2.4.1         nnet_7.3-14         cli_2.3.0          
## [34] magrittr_2.0.1      crayon_1.4.1        readxl_1.3.1       
## [37] evaluate_0.14       fs_1.5.0            fansi_0.4.2        
## [40] xml2_1.3.2          foreign_0.8-80      tools_4.0.2        
## [43] data.table_1.13.6   hms_1.0.0           lifecycle_0.2.0    
## [46] munsell_0.5.0       reprex_1.0.0        cluster_2.1.0      
## [49] compiler_4.0.2      rlang_0.4.10        grid_4.0.2         
## [52] rstudioapi_0.13     htmlwidgets_1.5.3   base64enc_0.1-3    
## [55] labeling_0.4.2      rmarkdown_2.6       gtable_0.3.0       
## [58] DBI_1.1.1           R6_2.5.0            gridExtra_2.3      
## [61] lubridate_1.7.9.2   knitr_1.31          utf8_1.1.4         
## [64] rprojroot_2.0.2     stringi_1.5.3       rmdformats_1.0.1   
## [67] Rcpp_1.0.6          vctrs_0.3.6         rpart_4.1-15       
## [70] png_0.1-7           dbplyr_2.1.0        tidyselect_1.1.0   
## [73] xfun_0.20

Multivariate distributions

This section reports a series of multivariate summaries of the NHANES dataset.

Overview

Variable correlation

Correlations of the physical activity variables (outcome)

Variable clustering

Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction.

## Hmisc::varclus(x = ~age + gender + bmi + sys + lbxtc + lbdhdd + 
##     smokecigs + drinkstatus + mortstat + diabetes + chf + cancer + 
##     stroke, data = a_nhanes)
## 
## 
## Similarity matrix (Spearman rho^2)
## 
##                             age genderFemale  bmi  sys lbxtc lbdhdd
## age                        1.00         0.00 0.00 0.19  0.00   0.00
## genderFemale               0.00         1.00 0.00 0.00  0.01   0.13
## bmi                        0.00         0.00 1.00 0.02  0.00   0.08
## sys                        0.19         0.00 0.02 1.00  0.00   0.00
## lbxtc                      0.00         0.01 0.00 0.00  1.00   0.02
## lbdhdd                     0.00         0.13 0.08 0.00  0.02   1.00
## smokecigsFormer            0.07         0.02 0.00 0.01  0.00   0.00
## smokecigsCurrent           0.02         0.01 0.01 0.00  0.00   0.01
## drinkstatusNon-Drinker     0.05         0.01 0.01 0.01  0.00   0.01
## drinkstatusHeavy Drinker   0.00         0.01 0.01 0.00  0.00   0.01
## drinkstatusMissing alcohol 0.01         0.00 0.00 0.00  0.00   0.00
## mortstat                   0.17         0.01 0.00 0.04  0.01   0.00
## diabetesYes                0.04         0.00 0.02 0.01  0.01   0.01
## chfYes                     0.03         0.00 0.00 0.00  0.01   0.00
## cancerYes                  0.06         0.00 0.00 0.00  0.00   0.00
## strokeYes                  0.03         0.00 0.00 0.01  0.00   0.00
##                            smokecigsFormer smokecigsCurrent
## age                                   0.07             0.02
## genderFemale                          0.02             0.01
## bmi                                   0.00             0.01
## sys                                   0.01             0.00
## lbxtc                                 0.00             0.00
## lbdhdd                                0.00             0.01
## smokecigsFormer                       1.00             0.12
## smokecigsCurrent                      0.12             1.00
## drinkstatusNon-Drinker                0.00             0.02
## drinkstatusHeavy Drinker              0.00             0.03
## drinkstatusMissing alcohol            0.00             0.00
## mortstat                              0.02             0.00
## diabetesYes                           0.00             0.00
## chfYes                                0.01             0.00
## cancerYes                             0.01             0.00
## strokeYes                             0.00             0.00
##                            drinkstatusNon-Drinker drinkstatusHeavy Drinker
## age                                          0.05                     0.00
## genderFemale                                 0.01                     0.01
## bmi                                          0.01                     0.01
## sys                                          0.01                     0.00
## lbxtc                                        0.00                     0.00
## lbdhdd                                       0.01                     0.01
## smokecigsFormer                              0.00                     0.00
## smokecigsCurrent                             0.02                     0.03
## drinkstatusNon-Drinker                       1.00                     0.04
## drinkstatusHeavy Drinker                     0.04                     1.00
## drinkstatusMissing alcohol                   0.04                     0.01
## mortstat                                     0.02                     0.00
## diabetesYes                                  0.02                     0.00
## chfYes                                       0.01                     0.00
## cancerYes                                    0.00                     0.00
## strokeYes                                    0.01                     0.00
##                            drinkstatusMissing alcohol mortstat diabetesYes
## age                                              0.01     0.17        0.04
## genderFemale                                     0.00     0.01        0.00
## bmi                                              0.00     0.00        0.02
## sys                                              0.00     0.04        0.01
## lbxtc                                            0.00     0.01        0.01
## lbdhdd                                           0.00     0.00        0.01
## smokecigsFormer                                  0.00     0.02        0.00
## smokecigsCurrent                                 0.00     0.00        0.00
## drinkstatusNon-Drinker                           0.04     0.02        0.02
## drinkstatusHeavy Drinker                         0.01     0.00        0.00
## drinkstatusMissing alcohol                       1.00     0.00        0.00
## mortstat                                         0.00     1.00        0.03
## diabetesYes                                      0.00     0.03        1.00
## chfYes                                           0.00     0.04        0.03
## cancerYes                                        0.00     0.03        0.00
## strokeYes                                        0.00     0.03        0.02
##                            chfYes cancerYes strokeYes
## age                          0.03      0.06      0.03
## genderFemale                 0.00      0.00      0.00
## bmi                          0.00      0.00      0.00
## sys                          0.00      0.00      0.01
## lbxtc                        0.01      0.00      0.00
## lbdhdd                       0.00      0.00      0.00
## smokecigsFormer              0.01      0.01      0.00
## smokecigsCurrent             0.00      0.00      0.00
## drinkstatusNon-Drinker       0.01      0.00      0.01
## drinkstatusHeavy Drinker     0.00      0.00      0.00
## drinkstatusMissing alcohol   0.00      0.00      0.00
## mortstat                     0.04      0.03      0.03
## diabetesYes                  0.03      0.00      0.02
## chfYes                       1.00      0.00      0.02
## cancerYes                    0.00      1.00      0.00
## strokeYes                    0.02      0.00      1.00
## 
## No. of observations used for each pair:
## 
##                             age genderFemale  bmi  sys lbxtc lbdhdd
## age                        6680         6680 6624 6360  6410   6410
## genderFemale               6680         6680 6624 6360  6410   6410
## bmi                        6624         6624 6624 6316  6358   6358
## sys                        6360         6360 6316 6360  6133   6133
## lbxtc                      6410         6410 6358 6133  6410   6410
## lbdhdd                     6410         6410 6358 6133  6410   6410
## smokecigsFormer            6676         6676 6621 6356  6406   6406
## smokecigsCurrent           6676         6676 6621 6356  6406   6406
## drinkstatusNon-Drinker     6680         6680 6624 6360  6410   6410
## drinkstatusHeavy Drinker   6680         6680 6624 6360  6410   6410
## drinkstatusMissing alcohol 6680         6680 6624 6360  6410   6410
## mortstat                   6671         6671 6615 6351  6401   6401
## diabetesYes                6680         6680 6624 6360  6410   6410
## chfYes                     6680         6680 6624 6360  6410   6410
## cancerYes                  6680         6680 6624 6360  6410   6410
## strokeYes                  6680         6680 6624 6360  6410   6410
##                            smokecigsFormer smokecigsCurrent
## age                                   6676             6676
## genderFemale                          6676             6676
## bmi                                   6621             6621
## sys                                   6356             6356
## lbxtc                                 6406             6406
## lbdhdd                                6406             6406
## smokecigsFormer                       6676             6676
## smokecigsCurrent                      6676             6676
## drinkstatusNon-Drinker                6676             6676
## drinkstatusHeavy Drinker              6676             6676
## drinkstatusMissing alcohol            6676             6676
## mortstat                              6667             6667
## diabetesYes                           6676             6676
## chfYes                                6676             6676
## cancerYes                             6676             6676
## strokeYes                             6676             6676
##                            drinkstatusNon-Drinker drinkstatusHeavy Drinker
## age                                          6680                     6680
## genderFemale                                 6680                     6680
## bmi                                          6624                     6624
## sys                                          6360                     6360
## lbxtc                                        6410                     6410
## lbdhdd                                       6410                     6410
## smokecigsFormer                              6676                     6676
## smokecigsCurrent                             6676                     6676
## drinkstatusNon-Drinker                       6680                     6680
## drinkstatusHeavy Drinker                     6680                     6680
## drinkstatusMissing alcohol                   6680                     6680
## mortstat                                     6671                     6671
## diabetesYes                                  6680                     6680
## chfYes                                       6680                     6680
## cancerYes                                    6680                     6680
## strokeYes                                    6680                     6680
##                            drinkstatusMissing alcohol mortstat diabetesYes
## age                                              6680     6671        6680
## genderFemale                                     6680     6671        6680
## bmi                                              6624     6615        6624
## sys                                              6360     6351        6360
## lbxtc                                            6410     6401        6410
## lbdhdd                                           6410     6401        6410
## smokecigsFormer                                  6676     6667        6676
## smokecigsCurrent                                 6676     6667        6676
## drinkstatusNon-Drinker                           6680     6671        6680
## drinkstatusHeavy Drinker                         6680     6671        6680
## drinkstatusMissing alcohol                       6680     6671        6680
## mortstat                                         6671     6671        6671
## diabetesYes                                      6680     6671        6680
## chfYes                                           6680     6671        6680
## cancerYes                                        6680     6671        6680
## strokeYes                                        6680     6671        6680
##                            chfYes cancerYes strokeYes
## age                          6680      6680      6680
## genderFemale                 6680      6680      6680
## bmi                          6624      6624      6624
## sys                          6360      6360      6360
## lbxtc                        6410      6410      6410
## lbdhdd                       6410      6410      6410
## smokecigsFormer              6676      6676      6676
## smokecigsCurrent             6676      6676      6676
## drinkstatusNon-Drinker       6680      6680      6680
## drinkstatusHeavy Drinker     6680      6680      6680
## drinkstatusMissing alcohol   6680      6680      6680
## mortstat                     6671      6671      6671
## diabetesYes                  6680      6680      6680
## chfYes                       6680      6680      6680
## cancerYes                    6680      6680      6680
## strokeYes                    6680      6680      6680
## 
## hclust results (method=complete)
## 
## 
## Call:
## hclust(d = as.dist(1 - x), method = method)
## 
## Cluster method   : complete 
## Number of objects: 16

Plot associations.

Variable redundancy

Redundancy analysis of predictor variables.

## 
## Redundancy Analysis
## 
## Hmisc::redun(formula = ~age + gender + bmi + sys + lbxtc + lbdhdd + 
##     smokecigs + drinkstatus + mortstat + diabetes + chf + cancer + 
##     stroke, data = a_nhanes)
## 
## n: 6080  p: 13   nk: 3 
## 
## Number of NAs:    600 
## Frequencies of Missing Values Due to Each Variable
##         age      gender         bmi         sys       lbxtc      lbdhdd 
##           0           0          56         320         270         270 
##   smokecigs drinkstatus    mortstat    diabetes         chf      cancer 
##           4           0           9           0           0           0 
##      stroke 
##           0 
## 
## 
## Transformation of target variables forced to be linear
## 
## R-squared cutoff: 0.9    Type: ordinary 
## 
## R^2 with which each variable can be predicted from all other variables:
## 
##         age      gender         bmi         sys       lbxtc      lbdhdd 
##       0.417       0.222       0.156       0.207       0.057       0.274 
##   smokecigs drinkstatus    mortstat    diabetes         chf      cancer 
##       0.116       0.142       0.282       0.110       0.091       0.080 
##      stroke 
##       0.062 
## 
## No redundant variables

Summary reports by age and gender

Distribution of age by gender

Distribution of age by gender

Summary report by age group and gender

Summary report by gender

Baseline characteristics by gender.
N
Male
N=3294
Female
N=3386
age
years
6680 41.8 53.8 68.0
55.1 ± 15.3
40.8 52.4 66.2
54.0 ± 15.4
body mass index
kg/m2
6624 24.99 27.94 31.26
28.58 ±  5.64
24.40 28.31 33.37
29.55 ±  7.10
education level : Less than high school 6673 0.30 974/3289 0.27 925/3384
  High school 0.24 798/3289 0.25 836/3384
  More than high school 0.46 1517/3289 0.48 1623/3384
Systolic blood pressure
mg/dl
6360 115.0 125.0 137.0
127.7 ±  18.2
111.0 123.0 139.0
126.8 ±  22.4
Total cholesterol
mg/dL
6410 172.0 198.0 225.0
200.6 ±  42.8
178.0 204.0 231.0
207.1 ±  42.6
HDL cholesterol
mg/dL
6410 40.0 46.0 56.0
49.0 ± 13.9
48.0 58.0 70.0
60.1 ± 17.2
smoking status : Never 6676 0.38 1266/3292 0.59 1987/3384
  Former 0.35 1167/3292 0.23 777/3384
  Current 0.26 859/3292 0.18 620/3384
alcohol consumption : Moderate Drinker 6680 0.56 1846/3294 0.47 1603/3386
  Non-Drinker 0.29 964/3294 0.41 1372/3386
  Heavy Drinker 0.08 276/3294 0.05 153/3386
  Missing alcohol 0.06 208/3294 0.08 258/3386
Final mortality status 6671 0.21 692/3291 0.14 489/3380
diabetes : Yes 6680 0.13 425/3294 0.13 427/3386
congestive heart failure : Yes 6680 0.05 160/3294 0.03 104/3386
cancer : Yes 6680 0.09 306/3294 0.11 366/3386
stroke : Yes 6680 0.04 136/3294 0.04 138/3386
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD.   N is the number of non-missing values.

Summary report by age group for men

Baseline characteristics by gae group for men.
N
30-44
N=1048
45-59
N=900
60-74
N=931
75+
N=415
body mass index
kg/m2
3272 24.86 27.93 31.32
28.80 ±  6.60
25.35 28.05 31.38
28.78 ±  5.41
25.41 28.23 31.81
28.78 ±  5.10
24.21 26.81 29.72
27.18 ±  4.35
education level : Less than high school 3289 0.25 259/1048 0.22 195/ 898 0.38 349/ 930 0.41 171/ 413
  High school 0.25 264/1048 0.26 233/ 898 0.23 212/ 930 0.22 89/ 413
  More than high school 0.50 525/1048 0.52 470/ 898 0.40 369/ 930 0.37 153/ 413
Systolic blood pressure
mg/dl
3164 112.0 119.0 129.0
121.0 ±  12.4
115.0 123.0 134.5
126.2 ±  17.1
120.0 131.0 145.0
133.3 ±  19.9
120.0 133.0 147.0
134.9 ±  21.9
Total cholesterol
mg/dL
3180 177.0 200.0 229.0
204.0 ±  42.1
178.0 204.0 231.0
206.7 ±  42.8
168.0 193.0 222.0
197.0 ±  44.1
158.5 185.0 212.5
187.2 ±  37.6
HDL cholesterol
mg/dL
3180 39.0 45.0 54.0
47.8 ± 14.1
40.0 47.0 57.0
49.2 ± 14.1
40.0 46.0 56.0
49.1 ± 13.4
41.0 47.0 58.0
50.7 ± 14.2
smoking status : Never 3292 0.49 518/1048 0.39 351/ 900 0.28 262/ 930 0.33 135/ 414
  Former 0.19 196/1048 0.28 253/ 900 0.51 472/ 930 0.59 246/ 414
  Current 0.32 334/1048 0.33 296/ 900 0.21 196/ 930 0.08 33/ 414
alcohol consumption : Moderate Drinker 3294 0.62 649/1048 0.59 533/ 900 0.51 477/ 931 0.45 187/ 415
  Non-Drinker 0.19 201/1048 0.25 226/ 900 0.37 345/ 931 0.46 192/ 415
  Heavy Drinker 0.10 104/1048 0.10 87/ 900 0.08 71/ 931 0.03 14/ 415
  Missing alcohol 0.09 94/1048 0.06 54/ 900 0.04 38/ 931 0.05 22/ 415
diabetes : Yes 3294 0.05 49/1048 0.10 90/ 900 0.23 218/ 931 0.16 68/ 415
congestive heart failure : Yes 3294 0.01 6/1048 0.03 28/ 900 0.09 80/ 931 0.11 46/ 415
cancer : Yes 3294 0.02 17/1048 0.04 39/ 900 0.15 137/ 931 0.27 113/ 415
stroke : Yes 3294 0.00 4/1048 0.02 15/ 900 0.07 69/ 931 0.12 48/ 415
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD.   N is the number of non-missing values.

Summary report by age group for women

Baseline characteristics by gae group for men.
N
30-44
N=1164
45-59
N=924
60-74
N=905
75+
N=393
body mass index
kg/m2
3352 23.99 28.00 33.37
29.34 ±  7.26
24.66 29.18 35.06
30.45 ±  7.67
24.92 28.82 33.36
29.86 ±  6.74
23.59 26.82 30.14
27.25 ±  5.31
education level : Less than high school 3384 0.22 258/1163 0.20 186/ 924 0.36 323/ 905 0.40 158/ 392
  High school 0.20 237/1163 0.24 226/ 924 0.28 250/ 905 0.31 123/ 392
  More than high school 0.57 668/1163 0.55 512/ 924 0.37 332/ 905 0.28 111/ 392
Systolic blood pressure
mg/dl
3196 104.0 112.0 120.0
113.2 ±  13.4
113.0 123.0 135.0
125.6 ±  19.5
121.0 135.0 150.2
137.2 ±  22.1
131.0 143.0 159.0
145.9 ±  24.9
Total cholesterol
mg/dL
3230 170.0 195.0 223.0
199.0 ±  42.5
182.0 208.0 232.0
209.6 ±  41.7
187.0 212.0 239.0
215.0 ±  42.2
176.0 204.0 234.0
207.1 ±  41.8
HDL cholesterol
mg/dL
3230 47.0 57.0 69.0
59.5 ± 17.3
47.0 57.0 70.0
60.1 ± 17.5
49.0 57.5 69.0
60.1 ± 16.7
49.0 60.0 73.0
62.0 ± 17.2
smoking status : Never 3384 0.64 745/1164 0.54 499/ 923 0.56 504/ 904 0.61 239/ 393
  Former 0.14 159/1164 0.23 216/ 923 0.30 273/ 904 0.33 129/ 393
  Current 0.22 260/1164 0.23 208/ 923 0.14 127/ 904 0.06 25/ 393
alcohol consumption : Moderate Drinker 3386 0.57 661/1164 0.51 471/ 924 0.39 352/ 905 0.30 119/ 393
  Non-Drinker 0.29 339/1164 0.34 316/ 924 0.53 484/ 905 0.59 233/ 393
  Heavy Drinker 0.04 46/1164 0.07 64/ 924 0.03 31/ 905 0.03 12/ 393
  Missing alcohol 0.10 118/1164 0.08 73/ 924 0.04 38/ 905 0.07 29/ 393
diabetes : Yes 3386 0.03 40/1164 0.13 121/ 924 0.22 195/ 905 0.18 71/ 393
congestive heart failure : Yes 3386 0.01 8/1164 0.02 19/ 924 0.05 42/ 905 0.09 35/ 393
cancer : Yes 3386 0.04 46/1164 0.10 92/ 924 0.14 130/ 905 0.25 98/ 393
stroke : Yes 3386 0.01 15/1164 0.04 33/ 924 0.06 51/ 905 0.10 39/ 393
a b c represent the lower quartile a, the median b, and the upper quartile c for continuous variables. x ± s represents X ± 1 SD.   N is the number of non-missing values.

Continuous variables by age and gender

Distribution of systolic blood pressure

Distribution of cholesterol

Distribution of BMI

Distribution of wear time

Physical activity data

Distribution of MVPA

Distribution of MVPA and Total log activity count by time of day

Section session info

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gridExtra_2.3   naniar_0.6.0    corrplot_0.84   gtsummary_1.3.6
##  [5] Hmisc_4.5-0     Formula_1.2-4   survival_3.1-12 lattice_0.20-41
##  [9] plotly_4.9.3    forcats_0.5.1   stringr_1.4.0   dplyr_1.0.4    
## [13] purrr_0.3.4     readr_1.4.0     tidyr_1.1.2     tibble_3.0.6   
## [17] ggplot2_3.3.3   tidyverse_1.3.0 here_1.0.1     
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-148        fs_1.5.0            usethis_2.0.1      
##  [4] lubridate_1.7.9.2   RColorBrewer_1.1-2  httr_1.4.2         
##  [7] rprojroot_2.0.2     tools_4.0.2         backports_1.2.1    
## [10] R6_2.5.0            rpart_4.1-15        mgcv_1.8-31        
## [13] DBI_1.1.1           lazyeval_0.2.2      colorspace_2.0-0   
## [16] nnet_7.3-14         withr_2.4.1         tidyselect_1.1.0   
## [19] compiler_4.0.2      cli_2.3.0           rvest_0.3.6        
## [22] gt_0.2.2            htmlTable_2.1.0     xml2_1.3.2         
## [25] labeling_0.4.2      bookdown_0.21       scales_1.1.1       
## [28] checkmate_2.0.0     digest_0.6.27       foreign_0.8-80     
## [31] rmarkdown_2.6       base64enc_0.1-3     jpeg_0.1-8.1       
## [34] pkgconfig_2.0.3     htmltools_0.5.1.1   dbplyr_2.1.0       
## [37] highr_0.8           htmlwidgets_1.5.3   rlang_0.4.10       
## [40] readxl_1.3.1        rstudioapi_0.13     farver_2.0.3       
## [43] generics_0.1.0      jsonlite_1.7.2      crosstalk_1.1.1    
## [46] magrittr_2.0.1      Matrix_1.2-18       Rcpp_1.0.6         
## [49] munsell_0.5.0       lifecycle_0.2.0     visdat_0.5.3       
## [52] stringi_1.5.3       yaml_2.2.1          grid_4.0.2         
## [55] crayon_1.4.1        haven_2.3.1         splines_4.0.2      
## [58] hms_1.0.0           knitr_1.31          pillar_1.4.7       
## [61] reprex_1.0.0        glue_1.4.2          evaluate_0.14      
## [64] latticeExtra_0.6-29 data.table_1.13.6   broom.helpers_1.1.0
## [67] modelr_0.1.8        png_0.1-7           vctrs_0.3.6        
## [70] rmdformats_1.0.1    cellranger_1.1.0    gtable_0.3.0       
## [73] assertthat_0.2.1    xfun_0.20           broom_0.7.4        
## [76] viridisLite_0.3.0   cluster_2.1.0       ellipsis_0.3.1